论文标题
TTS的可扩展多语言前端
Scalable Multilingual Frontend for TTS
论文作者
论文摘要
本文描述了使神经文本到语音(TTS)前端的进展,该前端适用于多种语言,并且很容易扩展到新语言。我们采用机器翻译(MT)启发的方法来构建前端,并通过构建和使用序列到序列(S2S)模型在句子级别上对文本归一化和发音进行建模。我们将训练归一化和发音作为单独的S2S模型进行了实验,并通过训练组合两个功能的单个S2S模型。 对于我们与语言无关的发音方法,我们不使用词典。取而代之的是,在S2S模型中捕获了所有发音,包括基于上下文的发音。我们还提出了一种独立于语言的块和剪接技术,该技术使我们能够处理任意长度的句子。对18种语言的模型进行了培训和评估。许多精度测量值高于99%。我们还在针对我们当前的生产系统的端到端合成的背景下评估了模型。
This paper describes progress towards making a Neural Text-to-Speech (TTS) Frontend that works for many languages and can be easily extended to new languages. We take a Machine Translation (MT) inspired approach to constructing the frontend, and model both text normalization and pronunciation on a sentence level by building and using sequence-to-sequence (S2S) models. We experimented with training normalization and pronunciation as separate S2S models and with training a single S2S model combining both functions. For our language-independent approach to pronunciation we do not use a lexicon. Instead all pronunciations, including context-based pronunciations, are captured in the S2S model. We also present a language-independent chunking and splicing technique that allows us to process arbitrary-length sentences. Models for 18 languages were trained and evaluated. Many of the accuracy measurements are above 99%. We also evaluated the models in the context of end-to-end synthesis against our current production system.