CS-NLP团队在Semeval-2020任务4：对常识性推理任务的最先进的NLP深度学习体系结构的评估

论文标题

CS-NLP团队在Semeval-2020任务4：对常识性推理任务的最先进的NLP深度学习体系结构的评估

CS-NLP team at SemEval-2020 Task 4: Evaluation of State-of-the-art NLP Deep Learning Architectures on Commonsense Reasoning Task

论文作者

Saeedi, Sirwe, Panahi, Aliakbar, Saeedi, Seyran, Fong, Alvis C

论文摘要

在本文中，我们研究了一个常识性推理任务，该任务统一了自然语言理解和常识性推理。我们描述了我们在Semeval-2020任务4竞赛中的尝试：常识验证和解释（Comve）挑战。我们讨论了有关此挑战的几个最先进的深度学习体系结构。我们的系统使用已准备好标记的文本数据集，这些数据集由三种不同的自然语言推理子任务手动策划。第一个子任务的目的是测试模型是否可以区分有意义的自然语言语句和那些没有意义的声明。我们比较了几种语言模型和微调分类器的性能。然后，我们提出了一种受问题/回答任务启发的方法，以将分类问题视为一个多项选择问题任务，以提高我们的实验结果的性能（96.06％），这比基线要好得多。对于第二个子任务，这是选择陈述没有意义的原因，我们站在前六支球队（93.7％）的27名参与者中，结果非常具竞争力。我们的最后一项子任务的结果是针对胡说八道的陈述的产生理由的结果，显示了未来研究的许多潜力，因为我们在前四支球队中应用了最强大的语言生成模型（GPT-2），BLEU得分为6.1732。

In this paper, we investigate a commonsense inference task that unifies natural language understanding and commonsense reasoning. We describe our attempt at SemEval-2020 Task 4 competition: Commonsense Validation and Explanation (ComVE) challenge. We discuss several state-of-the-art deep learning architectures for this challenge. Our system uses prepared labeled textual datasets that were manually curated for three different natural language inference subtasks. The goal of the first subtask is to test whether a model can distinguish between natural language statements that make sense and those that do not make sense. We compare the performance of several language models and fine-tuned classifiers. Then, we propose a method inspired by question/answering tasks to treat a classification problem as a multiple choice question task to boost the performance of our experimental results (96.06%), which is significantly better than the baseline. For the second subtask, which is to select the reason why a statement does not make sense, we stand within the first six teams (93.7%) among 27 participants with very competitive results. Our result for last subtask of generating reason against the nonsense statement shows many potentials for future researches as we applied the most powerful generative model of language (GPT-2) with 6.1732 BLEU score among first four teams.

下载PDF全文

下载文献需遵守相关版权规定

论文标题