野外任意说话者的口头到语音综合

论文标题

野外任意说话者的口头到语音综合

Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild

论文作者

Hegde, Sindhu B, Prajwal, K R, Mukhopadhyay, Rudrabha, Namboodiri, Vinay P, Jawahar, C. V.

论文摘要

在这项工作中，我们解决了为野外任何演讲者发起静音视频的演讲的问题。与以前的作品形成鲜明对比的是，我们的方法（i）不仅限于固定数量的扬声器，（ii）并未明确对域或词汇量的约束构成约束，并且（iii）涉及与实验室环境中野外记录的视频。该任务提出了许多挑战，关键是，所需的目标语音的许多功能（例如语音，音调和语言内容）不能完全从无声的脸部视频中推断出来。为了处理这些随机变化，我们提出了一种新的VAE-GAN结构，该结构学会了将唇部和语音序列关联到变化中。在指导培训过程的多个强大的歧视者的帮助下，我们的发电机学会了以任何人的唇部运动中的任何声音综合语音序列。多个数据集上的广泛实验表明，我们的表现要大得多。此外，我们的网络可以在特定身份的视频上进行微调，以实现与单扬声器模型相当的性能，该模型接受了$ 4 \ times $ $数据的培训。我们进行了大量的消融研究，以分析建筑的不同模块的影响。我们还提供了一个演示视频，该视频与我们的网站上的代码和经过训练的模型一起演示了几个定性结果：\ url {http://cvit.iiit.ac.ac.in/research/project/projects/cvit-projects/cvit-projects/lip-to-to-to-speech-synthesis}}}}}}

In this work, we address the problem of generating speech from silent lip videos for any speaker in the wild. In stark contrast to previous works, our method (i) is not restricted to a fixed number of speakers, (ii) does not explicitly impose constraints on the domain or the vocabulary and (iii) deals with videos that are recorded in the wild as opposed to within laboratory settings. The task presents a host of challenges, with the key one being that many features of the desired target speech, like voice, pitch and linguistic content, cannot be entirely inferred from the silent face video. In order to handle these stochastic variations, we propose a new VAE-GAN architecture that learns to associate the lip and speech sequences amidst the variations. With the help of multiple powerful discriminators that guide the training process, our generator learns to synthesize speech sequences in any voice for the lip movements of any person. Extensive experiments on multiple datasets show that we outperform all baselines by a large margin. Further, our network can be fine-tuned on videos of specific identities to achieve a performance comparable to single-speaker models that are trained on $4\times$ more data. We conduct numerous ablation studies to analyze the effect of different modules of our architecture. We also provide a demo video that demonstrates several qualitative results along with the code and trained models on our website: \url{http://cvit.iiit.ac.in/research/projects/cvit-projects/lip-to-speech-synthesis}}

下载PDF全文

下载文献需遵守相关版权规定

论文标题