材料科学领域中的SOFC-Exp语料库和信息提取的神经方法

论文标题

材料科学领域中的SOFC-Exp语料库和信息提取的神经方法

The SOFC-Exp Corpus and Neural Approaches to Information Extraction in the Materials Science Domain

论文作者

Friedrich, Annemarie, Adel, Heike, Tomazic, Federico, Hingerl, Johannes, Benteau, Renou, Maruscyk, Anika, Lange, Lukas

论文摘要

本文在材料科学领域提出了一项新的具有挑战性的信息提取任务。我们开发了一个注释方案，以标记有关科学出版物中与固体氧化物燃料电池相关的实验信息的信息，例如涉及的材料和测量条件。在本文中，我们发布了注释指南以及我们的SOFC-Exp语料库，该指南由45个由域专家注释的开放式学术文章组成。语料库和一项通知协议研究表明，建议的命名实体识别和插槽填充任务以及高注释质量的复杂性。我们还为各种任务提供了强大的基于神经网络的模型，这些模型可以根据我们的新数据集来解决。在所有任务上，使用BERT嵌入会带来巨大的性能增长，但是随着任务复杂性的增加，在顶部增加了一个经常性的神经网络似乎是有益的。我们的模型将在未来的工作中充当竞争基线，对其绩效的分析重点介绍了对数据进行建模时的困难案例，并提出了有希望的研究方向。

This paper presents a new challenging information extraction task in the domain of materials science. We develop an annotation scheme for marking information on experiments related to solid oxide fuel cells in scientific publications, such as involved materials and measurement conditions. With this paper, we publish our annotation guidelines, as well as our SOFC-Exp corpus consisting of 45 open-access scholarly articles annotated by domain experts. A corpus and an inter-annotator agreement study demonstrate the complexity of the suggested named entity recognition and slot filling tasks as well as high annotation quality. We also present strong neural-network based models for a variety of tasks that can be addressed on the basis of our new data set. On all tasks, using BERT embeddings leads to large performance gains, but with increasing task complexity, adding a recurrent neural network on top seems beneficial. Our models will serve as competitive baselines in future work, and analysis of their performance highlights difficult cases when modeling the data and suggests promising research directions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题