论文标题

EPIE数据集:可能惯用表达式的语料库

EPIE Dataset: A Corpus For Possible Idiomatic Expressions

论文作者

Saxena, Prateek, Paul, Soma

论文摘要

惯用表达一直是语言理解和自然语言理解的瓶颈,特别是针对机器翻译(MT)之类的任务。 MT系统主要产生惯用表达式的字面翻译,因为它们没有表现出通用和语言确定性的模式,可以利用这些模式来理解表达式的非构成含义。这些表达式出现在用于培训的并行语料库中,但是由于在字面上下文中,惯用表达的组成词的发生相对较高,因此,惯用含义被表达的组成含义所压倒。最先进的隐喻检测系统能够在单词级别检测非复合用法,但错过了特质性短语表达式。这产生了对具有更广泛覆盖范围的数据集的迫切需求,并且通常发生的惯用表达式的发生率更高,其跨度可用于隐喻检测。考虑到这一点,我们介绍了我们的英语可能的惯用表达式(EPIE)语料库,其中包含25206个标记的句子,上面标有717个惯用表达式的词汇实例。这些跨度还涵盖了给定的一组惯用表达式的字面用法。我们还通过使用该数据集训练序列标记模块并在三个独立数据集上以高精度,精度和召回分数进行测试来介绍我们的数据集的实用性。

Idiomatic expressions have always been a bottleneck for language comprehension and natural language understanding, specifically for tasks like Machine Translation(MT). MT systems predominantly produce literal translations of idiomatic expressions as they do not exhibit generic and linguistically deterministic patterns which can be exploited for comprehension of the non-compositional meaning of the expressions. These expressions occur in parallel corpora used for training, but due to the comparatively high occurrences of the constituent words of idiomatic expressions in literal context, the idiomatic meaning gets overpowered by the compositional meaning of the expression. State of the art Metaphor Detection Systems are able to detect non-compositional usage at word level but miss out on idiosyncratic phrasal idiomatic expressions. This creates a dire need for a dataset with a wider coverage and higher occurrence of commonly occurring idiomatic expressions, the spans of which can be used for Metaphor Detection. With this in mind, we present our English Possible Idiomatic Expressions(EPIE) corpus containing 25206 sentences labelled with lexical instances of 717 idiomatic expressions. These spans also cover literal usages for the given set of idiomatic expressions. We also present the utility of our dataset by using it to train a sequence labelling module and testing on three independent datasets with high accuracy, precision and recall scores.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源