论文标题

跨语性检索增强提示低资源语言

Cross-Lingual Retrieval Augmented Prompt for Low-Resource Languages

论文作者

Nie, Ercong, Liang, Sheng, Schmid, Helmut, Schütze, Hinrich

论文摘要

多语言审计的语言模型(MPLM)在最近的经验跨语性转移研究中表明了它们的强大多语言性。在本文中,我们提出了通过Crosslineally(PARC)管道提示的提示,以提高低资源语言(LRLS)的零拍摄性能,通过从高资源语言(HRL)中检索出的语义相似句子来增加上下文,作为提示。 PARC通过在10个未标记的设置(+5.1%)和标签的设置(+16.3%)的10个LRL(涵盖6个语言家庭)的多语言平行测试集(涵盖6个语言家庭)中,改善了三个下游任务(二进制情感分类,主题分类和自然语言推理)的零拍性能。 PARC标记还表现出较高的基线3.7%。我们发现一侧的跨语性转移性能与高资源和低资源语言之间的相似性以及另一侧的低资源预处理数据的数量之间存在显着的正相关。鲁棒性分析表明,PARC有可能通过更强大的MPLM实现更强的性能。

Multilingual Pretrained Language Models (MPLMs) have shown their strong multilinguality in recent empirical cross-lingual transfer studies. In this paper, we propose the Prompts Augmented by Retrieval Crosslingually (PARC) pipeline to improve the zero-shot performance on low-resource languages (LRLs) by augmenting the context with semantically similar sentences retrieved from a high-resource language (HRL) as prompts. PARC improves the zero-shot performance on three downstream tasks (binary sentiment classification, topic categorization and natural language inference) with multilingual parallel test sets across 10 LRLs covering 6 language families in both unlabeled settings (+5.1%) and labeled settings (+16.3%). PARC-labeled also outperforms the finetuning baseline by 3.7%. We find a significant positive correlation between cross-lingual transfer performance on one side, and the similarity between the high- and low-resource languages as well as the amount of low-resource pretraining data on the other side. A robustness analysis suggests that PARC has the potential to achieve even stronger performance with more powerful MPLMs.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源