呼吁在无监督的跨语性学习中更加严格

论文标题

呼吁在无监督的跨语性学习中更加严格

A Call for More Rigor in Unsupervised Cross-lingual Learning

论文作者

Artetxe, Mikel, Ruder, Sebastian, Yogatama, Dani, Labaka, Gorka, Agirre, Eneko

论文摘要

我们回顾了无监督的跨语性学习的动机，定义，方法和方法，并要求在每个学习中更加严格的立场。现有的此类研究的理由是基于许多世界语言缺乏并行数据。但是，我们认为没有任何并行数据和丰富单语言数据的情况在实践中是不现实的。我们还讨论了先前工作中使用的不同培训信号，这些信号与纯粹的无监督环境背道而驰。然后，我们在调整和评估无监督的跨语性模型中描述常见的方法论问题，并提出最佳实践。最后，我们为该领域的不同类型的研究提供了统一的前景（即跨语言词嵌入，深层多语言预处理和无监督的机器翻译），并主张对这些模型的可比评估。

We review motivations, definition, approaches, and methodology for unsupervised cross-lingual learning and call for a more rigorous position in each of them. An existing rationale for such research is based on the lack of parallel data for many of the world's languages. However, we argue that a scenario without any parallel data and abundant monolingual data is unrealistic in practice. We also discuss different training signals that have been used in previous work, which depart from the pure unsupervised setting. We then describe common methodological issues in tuning and evaluation of unsupervised cross-lingual models and present best practices. Finally, we provide a unified outlook for different types of research in this area (i.e., cross-lingual word embeddings, deep multilingual pretraining, and unsupervised machine translation) and argue for comparable evaluation of these models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题