论文标题
基于机器学习的矛盾检测模型的语言调查:经验分析和未来观点
A Linguistic Investigation of Machine Learning based Contradiction Detection Models: An Empirical Analysis and Future Perspectives
论文作者
论文摘要
我们分析了两个自然语言推理数据集,相对于它们的语言特征。目的是确定那些对于机器学习模型特别难以理解的句法和语义属性。为此,我们还研究了众包,机器翻译的数据集(SNLI)和来自Internet源的文本对的差异。我们的主要发现是,该模型很难认识到介词和动词的语义重要性,从而强调了语言意识到的训练前训练任务的重要性。此外,它通常不理解反义词和谐音,尤其是依赖上下文的情况下。不完整的句子是另一个问题,还有较长的段落和稀有单词或短语。研究表明,自动化语言理解需要一种更明智的方法,并在整个培训过程中都采用了尽可能多的外部知识。
We analyze two Natural Language Inference data sets with respect to their linguistic features. The goal is to identify those syntactic and semantic properties that are particularly hard to comprehend for a machine learning model. To this end, we also investigate the differences between a crowd-sourced, machine-translated data set (SNLI) and a collection of text pairs from internet sources. Our main findings are, that the model has difficulty recognizing the semantic importance of prepositions and verbs, emphasizing the importance of linguistically aware pre-training tasks. Furthermore, it often does not comprehend antonyms and homonyms, especially if those are depending on the context. Incomplete sentences are another problem, as well as longer paragraphs and rare words or phrases. The study shows that automated language understanding requires a more informed approach, utilizing as much external knowledge as possible throughout the training process.