流媒体意识到的多头注意在线端到端语音识别

论文标题

流媒体意识到的多头注意在线端到端语音识别

Streaming Chunk-Aware Multihead Attention for Online End-to-End Speech Recognition

论文作者

Zhang, Shiliang, Gao, Zhifu, Luo, Haoneng, Lei, Ming, Gao, Jie, Yan, Zhijie, Xie, Lei

论文摘要

最近，流端到端自动语音识别（E2E-ASR）引起了越来越多的关注。已经付出了许多努力，以将非流域的基于注意的E2E-ASR系统变成流式体系结构。在这项工作中，我们通过使用流块的多头注意（SCAMA）和配备了延迟控制记忆的自我注意力专门网络（LC-SAN-M）来提出一种新颖的在线E2E-ASR系统。 LC-SAN-M使用块级输入来控制编码器的延迟。至于史典礼，使用训练有素的预测指标来控制编码器的编码器的输出，这使解码器能够以流方式生成输出。开放170小时的Aishell-1和工业级20000小时的普通话识别任务的实验结果表明，我们的方法在可比的设置下的实验性结果可以显着优于基于摩托马华的基线系统。在Aishell-1任务上，我们提出的方法据我们所知，达到7.39％的角色错误率（CER），这是在线ASR的最佳发布性能。

Recently, streaming end-to-end automatic speech recognition (E2E-ASR) has gained more and more attention. Many efforts have been paid to turn the non-streaming attention-based E2E-ASR system into streaming architecture. In this work, we propose a novel online E2E-ASR system by using Streaming Chunk-Aware Multihead Attention(SCAMA) and a latency control memory equipped self-attention network (LC-SAN-M). LC-SAN-M uses chunk-level input to control the latency of encoder. As to SCAMA, a jointly trained predictor is used to control the output of encoder when feeding to decoder, which enables decoder to generate output in streaming manner. Experimental results on the open 170-hour AISHELL-1 and an industrial-level 20000-hour Mandarin speech recognition tasks show that our approach can significantly outperform the MoChA-based baseline system under comparable setup. On the AISHELL-1 task, our proposed method achieves a character error rate (CER) of 7.39%, to the best of our knowledge, which is the best published performance for online ASR.

下载PDF全文

下载文献需遵守相关版权规定

论文标题