代码切换数据及其对POS标记的影响的词汇归一化

论文标题

代码切换数据及其对POS标记的影响的词汇归一化

Lexical Normalization for Code-switched Data and its Effect on POS-tagging

论文作者

van der Goot, Rob, Çetinoğlu, Özlem

论文摘要

词汇正常化是将非规范数据转换为标准语言的翻译，已证明可以改善社交媒体上多个自然语言处理任务的性能。然而，尽管在社交媒体中使用了一种常规化系统，但使用一种话语中的多种语言（CS）也经常被称为代码转换（CS）。在本文中，我们提出了三个专门设计的归一化模型，以处理代码切换的数据，我们评估了两种语言对：印尼 - 英语（ID-EN）和土耳其 - 德国人（TR-DE）。对于后者，我们为数据集介绍了新颖的归一化层及其相应的语言ID和POS标签，并评估归一化对POS标签的下游效果。结果表明，我们的CS标准化模型优于ART和TR-DE单语模型的表现，并且与非均衡输入相比，POS标记的相对性能提高了5.4％。

Lexical normalization, the translation of non-canonical data to standard language, has shown to improve the performance of manynatural language processing tasks on social media. Yet, using multiple languages in one utterance, also called code-switching (CS), is frequently overlooked by these normalization systems, despite its common use in social media. In this paper, we propose three normalization models specifically designed to handle code-switched data which we evaluate for two language pairs: Indonesian-English (Id-En) and Turkish-German (Tr-De). For the latter, we introduce novel normalization layers and their corresponding language ID and POS tags for the dataset, and evaluate the downstream effect of normalization on POS tagging. Results show that our CS-tailored normalization models outperform Id-En state of the art and Tr-De monolingual models, and lead to 5.4% relative performance increase for POS tagging as compared to unnormalized input.

下载PDF全文

下载文献需遵守相关版权规定

论文标题