注意提高了语音识别的Citrinet

论文标题

注意提高了语音识别的Citrinet

Attention Enhanced Citrinet for Speech Recognition

论文作者

Wu, Xianchao

论文摘要

Citrinet是基于端到端卷积连接派时间分类（CTC）自动语音识别（ASR）模型。为了捕获本地和全球上下文信息，Citrinet中使用了1D时间通道可分离的卷积与子词编码和挤压和兴奋（SE）的结合（SE），使整个体系结构与23个块和235个卷积层和46个线性层一样深。这种纯净的卷积和深度建筑使得critrinet在收敛时相对较慢。在本文中，我们建议在Citrinet块中的卷积模块中引入多头关注，同时保持SE模块和残留模块不变。为了加快加速，我们在每个注意力增强的Citrinet块中删除了8个卷积层，并将23个块减少到13个。在日语CSJ-500H和Magic-1600H数据集中进行的实验表明，注意力增强的Citrinet具有较少的层和块，并且比（1）cittrinet时间和（1）cittrinet and（2）cittrinet and（2）（2）（2）cile速度较低（2）（2） 29.8 \％型号大小。

Citrinet is an end-to-end convolutional Connectionist Temporal Classification (CTC) based automatic speech recognition (ASR) model. To capture local and global contextual information, 1D time-channel separable convolutions combined with sub-word encoding and squeeze-and-excitation (SE) are used in Citrinet, making the whole architecture to be as deep as including 23 blocks with 235 convolution layers and 46 linear layers. This pure convolutional and deep architecture makes Critrinet relatively slow at convergence. In this paper, we propose to introduce multi-head attentions together with feed-forward networks in the convolution module in Citrinet blocks while keeping the SE module and residual module unchanged. For speeding up, we remove 8 convolution layers in each attention-enhanced Citrinet block and reduce 23 blocks to 13. Experiments on the Japanese CSJ-500h and Magic-1600h dataset show that the attention-enhanced Citrinet with less layers and blocks and converges faster with lower character error rates than (1) Citrinet with 80\% training time and (2) Conformer with 40\% training time and 29.8\% model size.

下载PDF全文

下载文献需遵守相关版权规定

论文标题