多模式变压器蒸馏，用于视听同步

论文标题

多模式变压器蒸馏，用于视听同步

Multimodal Transformer Distillation for Audio-Visual Synchronization

论文作者

Chen, Xuanjun, Wu, Haibin, Wang, Chung-Che, Lee, Hung-yi, Jang, Jyh-Shing Roger

论文摘要

视听同步旨在确定视频中的口腔运动和语音是否同步。歌手通过将多模式变压器纳入模拟视听互动信息来达到最先进的性能。但是，它需要高计算资源，使其对于现实世界应用程序不切实际。本文提出了一个mtdvocalist模型，该模型受我们提出的多模式变压器蒸馏（MTD）损失训练。 MTD损失使MTDVocalist模型能够深入模仿歌手变压器中的交叉注意分布和价值关系。此外，我们利用不确定性加权以完全利用所有层的相互作用信息。我们提出的方法在两个方面有效：从蒸馏方法的角度来看，MTD损失优于其他强蒸馏基线。从蒸馏模型的性能角度来看：1）mtdvocalist优于相似大小的SOTA模型，Syncnet和Perfect Match Models的表现为15.65％和3.35％； 2）mtdvocalist将歌手的模型大小降低了83.52％，但仍保持相似的性能。

Audio-visual synchronization aims to determine whether the mouth movements and speech in the video are synchronized. VocaLiST reaches state-of-the-art performance by incorporating multimodal Transformers to model audio-visual interact information. However, it requires high computing resources, making it impractical for real-world applications. This paper proposed an MTDVocaLiST model, which is trained by our proposed multimodal Transformer distillation (MTD) loss. MTD loss enables MTDVocaLiST model to deeply mimic the cross-attention distribution and value-relation in the Transformer of VocaLiST. Additionally, we harness uncertainty weighting to fully exploit the interaction information across all layers. Our proposed method is effective in two aspects: From the distillation method perspective, MTD loss outperforms other strong distillation baselines. From the distilled model's performance perspective: 1) MTDVocaLiST outperforms similar-size SOTA models, SyncNet, and Perfect Match models by 15.65% and 3.35%; 2) MTDVocaLiST reduces the model size of VocaLiST by 83.52%, yet still maintaining similar performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题