论文标题
在双语方法中调查语言影响计算语言文档
Investigating Language Impact in Bilingual Approaches for Computational Language Documentation
论文作者
论文摘要
对于濒临灭绝的语言,数据收集活动必须满足许多人来自口头传统的挑战,而产生转录的成本很高。因此,将它们翻译成一种通俗的语言是至关重要的,以确保录音的解释性。在本文中,我们调查了翻译语言的选择如何影响后文档工作和潜在的自动方法,这些方法将在生产的双语语料库之上起作用。为了回答这个问题,我们使用质量多语言语料库(Boito等,2020)来创建56个双语对,我们将其应用于低资源无监督的单词细分和对齐的任务。我们的结果表明,翻译的语言选择会影响分割性能一词,并且通过使用不同的对齐翻译来学习不同的词典。最后,本文提出了一种用于双语单词分割的混合方法,结合了从非参数贝叶斯模型(Goldwater等,2009a)中提取的边界线索与Godard等人的注意单词分割神经模型。 (2018)。我们的结果表明,将这些线索纳入神经模型的输入表示会增加其翻译和对齐质量,特别是用于具有挑战性的语言对。
For endangered languages, data collection campaigns have to accommodate the challenge that many of them are from oral tradition, and producing transcriptions is costly. Therefore, it is fundamental to translate them into a widely spoken language to ensure interpretability of the recordings. In this paper we investigate how the choice of translation language affects the posterior documentation work and potential automatic approaches which will work on top of the produced bilingual corpus. For answering this question, we use the MaSS multilingual speech corpus (Boito et al., 2020) for creating 56 bilingual pairs that we apply to the task of low-resource unsupervised word segmentation and alignment. Our results highlight that the choice of language for translation influences the word segmentation performance, and that different lexicons are learned by using different aligned translations. Lastly, this paper proposes a hybrid approach for bilingual word segmentation, combining boundary clues extracted from a non-parametric Bayesian model (Goldwater et al., 2009a) with the attentional word segmentation neural model from Godard et al. (2018). Our results suggest that incorporating these clues into the neural models' input representation increases their translation and alignment quality, specially for challenging language pairs.