带有往返翻译的vec2Text

论文标题

带有往返翻译的vec2Text

vec2text with Round-Trip Translations

论文作者

Cideron, Geoffrey, Girgin, Sertan, Raichuk, Anton, Pietquin, Olivier, Bachem, Olivier, Hussenot, Léonard

论文摘要

我们研究可以从有限，凸面和良好的控制空间中生成任意自然语言文本（例如所有英语句子）的模型。我们称它们为通用Vec2Text模型。这样的模型将允许在矢量空间（例如通过强化学习）做出语义决策，而自然语言的产生是由VEC2Text模型处理的。我们提出了四个所需的特性：这种VEC2Text模型应具有的普遍性，多样性，流利性和语义结构，我们提供了定量和定性的方法来评估它们。我们通过将瓶颈添加到250m参数变压器模型中，并通过从大型Web语料库中提取的400m句子（10B代币）上进行自动编码目标来实现VEC2Text模型。我们提出了一种基于往返翻译的简单数据增强技术，并在广泛的实验中表明，由此产生的VEC2Text模型令人惊讶地导致矢量空间实现了我们的四个所需的属性，并且该模型强烈超过了标准和deno自动编码器的表现。

We investigate models that can generate arbitrary natural language text (e.g. all English sentences) from a bounded, convex and well-behaved control space. We call them universal vec2text models. Such models would allow making semantic decisions in the vector space (e.g. via reinforcement learning) while the natural language generation is handled by the vec2text model. We propose four desired properties: universality, diversity, fluency, and semantic structure, that such vec2text models should possess and we provide quantitative and qualitative methods to assess them. We implement a vec2text model by adding a bottleneck to a 250M parameters Transformer model and training it with an auto-encoding objective on 400M sentences (10B tokens) extracted from a massive web corpus. We propose a simple data augmentation technique based on round-trip translations and show in extensive experiments that the resulting vec2text model surprisingly leads to vector spaces that fulfill our four desired properties and that this model strongly outperforms both standard and denoising auto-encoders.

下载PDF全文

下载文献需遵守相关版权规定

论文标题