论文标题
多渠道多帧ADL-MVDR用于目标语音分离
Multi-channel Multi-frame ADL-MVDR for Target Speech Separation
论文作者
论文摘要
已经提出了许多纯粹基于神经网络的语音分离方法来提高客观评估得分,但它们经常引入对现代自动语音识别(ASR)系统有害的非线性畸变。通常采用最小差异反应(MVDR)过滤器来消除非线性畸变,但是,常规的基于神经掩模的MVDR系统仍会导致剩余噪声水平相对较高。此外,在与神经网络的联合培训期间,与MVDR溶液相关的基质有时在数值上不稳定。在这项研究中,我们提出了一个多通道多帧(MCMF)所有深度学习(ADL)-MVDR方法,用于目标语音分离,这扩展了我们的初步多渠道ADL-MVDR方法。提出的MCMF ADL-MVDR系统解决了线性和非线性畸变。在提出的方法中,时空跨相关性也被充分利用。使用普通话的视听语料库评估所提出的系统,并将其与几种最先进的方法进行比较。实验结果表明,在不同的情况下以及包括ASR性能在内的几个客观评估指标中,我们提出的系统的优势。
Many purely neural network based speech separation approaches have been proposed to improve objective assessment scores, but they often introduce nonlinear distortions that are harmful to modern automatic speech recognition (ASR) systems. Minimum variance distortionless response (MVDR) filters are often adopted to remove nonlinear distortions, however, conventional neural mask-based MVDR systems still result in relatively high levels of residual noise. Moreover, the matrix inverse involved in the MVDR solution is sometimes numerically unstable during joint training with neural networks. In this study, we propose a multi-channel multi-frame (MCMF) all deep learning (ADL)-MVDR approach for target speech separation, which extends our preliminary multi-channel ADL-MVDR approach. The proposed MCMF ADL-MVDR system addresses linear and nonlinear distortions. Spatio-temporal cross correlations are also fully utilized in the proposed approach. The proposed systems are evaluated using a Mandarin audio-visual corpus and are compared with several state-of-the-art approaches. Experimental results demonstrate the superiority of our proposed systems under different scenarios and across several objective evaluation metrics, including ASR performance.