论文标题
很少射击自适应归一化驱动多演讲者语音综合
Few Shot Adaptive Normalization Driven Multi-Speaker Speech Synthesis
论文作者
论文摘要
演讲的风格因人而异,每个人都表现出自己的说话风格,这些风格由语言,地理,文化和其他因素决定。风格最好被信号的韵律捕获。高质量的多扬声器语音综合在考虑韵律和几种镜头的同时,是许多现实世界应用的积极研究领域。尽管已经朝这个方向做出了多种努力,但它仍然是一个有趣且充满挑战的问题。在本文中,我们提出了一种新颖的射击多演讲者语音合成方法(FSM-SS),该方法利用非自动回调的多头注意力模型来利用自适应归一化体系结构。给定一个看不见的人的输入文本和参考语音样本,FSM-SS可以以几种方式以该人的风格产生演讲。此外,我们演示了标准化的仿射参数如何有助于以分离的方式捕获韵律特征,例如能量和基本频率,并可用于产生变形的语音输出。我们使用多个定量指标来证明我们提出的架构对多扬声器VCTK和Libritts数据集的功效,这些定量指标可以测量产生的语音扭曲和MOS,以及对生成的语音的说话者的分析与实际语音样本的嵌入分析。
The style of the speech varies from person to person and every person exhibits his or her own style of speaking that is determined by the language, geography, culture and other factors. Style is best captured by prosody of a signal. High quality multi-speaker speech synthesis while considering prosody and in a few shot manner is an area of active research with many real-world applications. While multiple efforts have been made in this direction, it remains an interesting and challenging problem. In this paper, we present a novel few shot multi-speaker speech synthesis approach (FSM-SS) that leverages adaptive normalization architecture with a non-autoregressive multi-head attention model. Given an input text and a reference speech sample of an unseen person, FSM-SS can generate speech in that person's style in a few shot manner. Additionally, we demonstrate how the affine parameters of normalization help in capturing the prosodic features such as energy and fundamental frequency in a disentangled fashion and can be used to generate morphed speech output. We demonstrate the efficacy of our proposed architecture on multi-speaker VCTK and LibriTTS datasets, using multiple quantitative metrics that measure generated speech distortion and MoS, along with speaker embedding analysis of the generated speech vs the actual speech samples.