论文标题
小米:高质量和集成的歌声综合系统
XiaoiceSing: A High-Quality and Integrated Singing Voice Synthesis System
论文作者
论文摘要
本文介绍了小米,这是一种高质量的唱歌语音合成系统,该系统采用了频谱,F0和持续时间建模的集成网络。我们遵循FastSpeech的主要体系结构,同时提出一些特定于唱歌的设计:1)除了音素ID和位置编码外,还添加了音乐得分(例如音调和长度)的功能。 2)为了减轻关键问题,我们在F0预测中添加了剩余连接。 3)除每个音素的持续时间损失外,音符中所有音素的持续时间都被累积以计算节奏增强的音节持续时间损失。实验结果表明,小刀在声音质量上优于1.44 MOS的卷积神经网络的基线系统,发音精度为1.18,自然性分别为1.38。在两个A/B测试中,拟议的F0和持续时间建模方法分别比基线获得了97.3%和84.3%的优先率,这证明了小米的压倒性优势。
This paper presents XiaoiceSing, a high-quality singing voice synthesis system which employs an integrated network for spectrum, F0 and duration modeling. We follow the main architecture of FastSpeech while proposing some singing-specific design: 1) Besides phoneme ID and position encoding, features from musical score (e.g.note pitch and length) are also added. 2) To attenuate off-key issues, we add a residual connection in F0 prediction. 3) In addition to the duration loss of each phoneme, the duration of all the phonemes in a musical note is accumulated to calculate the syllable duration loss for rhythm enhancement. Experiment results show that XiaoiceSing outperforms the baseline system of convolutional neural networks by 1.44 MOS on sound quality, 1.18 on pronunciation accuracy and 1.38 on naturalness respectively. In two A/B tests, the proposed F0 and duration modeling methods achieve 97.3% and 84.3% preference rate over baseline respectively, which demonstrates the overwhelming advantages of XiaoiceSing.