Bert遇到CTC：使用预训练的掩盖语言模型的端到端语音识别的新表述

论文标题

Bert遇到CTC：使用预训练的掩盖语言模型的端到端语音识别的新表述

BERT Meets CTC: New Formulation of End-to-End Speech Recognition with Pre-trained Masked Language Model

论文作者

Higuchi, Yosuke, Yan, Brian, Arora, Siddhant, Ogawa, Tetsuji, Kobayashi, Tetsunori, Watanabe, Shinji

论文摘要

本文介绍了BERT-CTC，这是一种端到端语音识别的新表述，该表达式识别适应BERT进行连接派时间分类（CTC）。我们的公式放宽了常规CTC中使用的条件独立假设，并通过BERT上下文嵌入获得的明确输出依赖性结合了语言知识。 BERT-CTC通过自我发项机制参与输入和假设输出序列的完整环境。这种机制鼓励模型学习音频和令牌表示之间的内部/相互依存关系，同时保持CTC的训练效率。在推断期间，BERT-CTC将掩模预测算法与CTC解码结合在一起，迭代地完善了输出序列。实验结果表明，BERT-CTC在说话风格和语言方面的常规方法上有所改善。最后，我们表明，BERT-CTC中的语义表示对下游语言理解任务是有益的。

This paper presents BERT-CTC, a novel formulation of end-to-end speech recognition that adapts BERT for connectionist temporal classification (CTC). Our formulation relaxes the conditional independence assumptions used in conventional CTC and incorporates linguistic knowledge through the explicit output dependency obtained by BERT contextual embedding. BERT-CTC attends to the full contexts of the input and hypothesized output sequences via the self-attention mechanism. This mechanism encourages a model to learn inner/inter-dependencies between the audio and token representations while maintaining CTC's training efficiency. During inference, BERT-CTC combines a mask-predict algorithm with CTC decoding, which iteratively refines an output sequence. The experimental results reveal that BERT-CTC improves over conventional approaches across variations in speaking styles and languages. Finally, we show that the semantic representations in BERT-CTC are beneficial towards downstream spoken language understanding tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题