字节：使用持续时间分配的编码模型和Wavernn Vocoders的中国唱歌语音合成系统

论文标题

字节：使用持续时间分配的编码模型和Wavernn Vocoders的中国唱歌语音合成系统

ByteSing: A Chinese Singing Voice Synthesis System Using Duration Allocated Encoder-Decoder Acoustic Models and WaveRNN Vocoders

论文作者

Gu, Yu, Yin, Xiang, Rao, Yonghui, Wan, Yuan, Tang, Benlai, Zhang, Yang, Chen, Jitong, Wang, Yuxuan, Ma, Zejun

论文摘要

本文介绍了基于持续时间分配的Tacotron样声学模型和Wavernn神经声码器的中国唱歌语音合成（SVS）系统。与常规的SVS模型不同，所提出的bytesing将类似Tacotron的编码器结构作为声学模型，其中CBHG模型和复发性神经网络（RNN）分别作为编码器和解码器探索。同时，利用辅助音素持续时间预测模型来扩展输入序列，从而可以增强模型可控能力，模型稳定性和节奏预测的准确性。 Wavernn神经声码器也被用作神经声码器，以进一步提高合成歌曲的语音质量。客观和主观实验结果都证明，本文提出的SVS方法可以通过提高音高和频谱图预测准确性来产生相当自然，表达和高保真的歌曲，并且使用注意力机制的模型可以实现最佳性能。

This paper presents ByteSing, a Chinese singing voice synthesis (SVS) system based on duration allocated Tacotron-like acoustic models and WaveRNN neural vocoders. Different from the conventional SVS models, the proposed ByteSing employs Tacotron-like encoder-decoder structures as the acoustic models, in which the CBHG models and recurrent neural networks (RNNs) are explored as encoders and decoders respectively. Meanwhile an auxiliary phoneme duration prediction model is utilized to expand the input sequence, which can enhance the model controllable capacity, model stability and tempo prediction accuracy. WaveRNN neural vocoders are also adopted as neural vocoders to further improve the voice quality of synthesized songs. Both objective and subjective experimental results prove that the SVS method proposed in this paper can produce quite natural, expressive and high-fidelity songs by improving the pitch and spectrogram prediction accuracy and the models using attention mechanism can achieve best performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题