迈向竞争性的端到端语音识别Chime-6晚餐派对转录

论文标题

迈向竞争性的端到端语音识别Chime-6晚餐派对转录

Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner Party Transcription

论文作者

Andrusenko, Andrei, Laptev, Aleksandr, Medennikov, Ivan

论文摘要

尽管端到端的ASR系统已证明与常规混合方法具有竞争力，但在嘈杂和低资源条件方面，它们易于准确降解。在本文中，我们认为，即使在这种困难的情况下，某些端到端方法也显示出与混合基线接近的性能。为了证明这一点，我们使用Chime-6挑战数据作为充满挑战的环境和日常言语嘈杂条件的一个例子。我们通过实验比较和分析CTC注意力与RNN-Transducer方法以及RNN与变压器体系结构进行了比较和分析。我们还提供了声学特征和语音增强功能的比较。此外，我们评估了神经网络语言模型在低资源条件下重新评分假设的有效性。我们基于RNN-TransDucer的最佳端到端模型，加上改进的光束搜索，仅达到3.8％的ABS质量。比LF-MMI TDNN-F Chime-6挑战基线更糟糕。通过基于指导源分离的培训数据的增强，这种方法的表现将混合基线系统的表现高出2.7％WER ABS。端到端系统以前最著名的是25.7％。

While end-to-end ASR systems have proven competitive with the conventional hybrid approach, they are prone to accuracy degradation when it comes to noisy and low-resource conditions. In this paper, we argue that, even in such difficult cases, some end-to-end approaches show performance close to the hybrid baseline. To demonstrate this, we use the CHiME-6 Challenge data as an example of challenging environments and noisy conditions of everyday speech. We experimentally compare and analyze CTC-Attention versus RNN-Transducer approaches along with RNN versus Transformer architectures. We also provide a comparison of acoustic features and speech enhancements. Besides, we evaluate the effectiveness of neural network language models for hypothesis re-scoring in low-resource conditions. Our best end-to-end model based on RNN-Transducer, together with improved beam search, reaches quality by only 3.8% WER abs. worse than the LF-MMI TDNN-F CHiME-6 Challenge baseline. With the Guided Source Separation based training data augmentation, this approach outperforms the hybrid baseline system by 2.7% WER abs. and the end-to-end system best known before by 25.7% WER abs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题