用于识别科学声音文章的级联神经合奏

论文标题

用于识别科学声音文章的级联神经合奏

Cascade Neural Ensemble for Identifying Scientifically Sound Articles

论文作者

Ambalavanan, Ashwin Karthik, Devarakonda, Murthy

论文摘要

背景：进行系统评价和荟萃分析的重大障碍是有效地发现科学相关的文章。通常，不到1％的文章符合此要求，这导致了高度不平衡的任务。尽管研究了该任务的功能设计和早期神经网络模型，但有机会改善结果。方法：我们将过滤文章作为一项分类任务的问题构成了问题，并在Medline的大约50K文章的手动注释数据集中训练和测试了Scibert的几个集合体系结构，Scibert是Scibert的几个合奏架构。由于科学的声音文章是通过多步过程确定的，因此我们提出了一种类似于选择过程的新型级联集合。我们将级联合奏的性能与单个集成模型和其他类型的合奏以及先前研究的结果进行了比较。结果：与先前在50K文章的选定子集上进行了评估和评估的CNN模型相比，级联集合体系结构达到了0.7505 F度量，令人印象深刻的49.1％错误率降低。在完整数据集中，级联集成达到了0.7639 F度量，与以前使用完整数据集的研究中报告的最佳性能相比，错误率降低了19.7％。结论：预先训练的上下文编码器神经网络（例如SCIBERT）的表现比以前研究的模型更好，并在过滤科学声音相关文章时手动创建了搜索过滤器。级联合奏实现的优越性能是超出此任务和数据集超出此任务的重要结果，并且类似于IR和数据库中的查询优化。

Background: A significant barrier to conducting systematic reviews and meta-analysis is efficiently finding scientifically sound relevant articles. Typically, less than 1% of articles match this requirement which leads to a highly imbalanced task. Although feature-engineered and early neural networks models were studied for this task, there is an opportunity to improve the results. Methods: We framed the problem of filtering articles as a classification task, and trained and tested several ensemble architectures of SciBERT, a variant of BERT pre-trained on scientific articles, on a manually annotated dataset of about 50K articles from MEDLINE. Since scientifically sound articles are identified through a multi-step process we proposed a novel cascade ensemble analogous to the selection process. We compared the performance of the cascade ensemble with a single integrated model and other types of ensembles as well as with results from previous studies. Results: The cascade ensemble architecture achieved 0.7505 F measure, an impressive 49.1% error rate reduction, compared to a CNN model that was previously proposed and evaluated on a selected subset of the 50K articles. On the full dataset, the cascade ensemble achieved 0.7639 F measure, resulting in an error rate reduction of 19.7% compared to the best performance reported in a previous study that used the full dataset. Conclusion: Pre-trained contextual encoder neural networks (e.g. SciBERT) perform better than the models studied previously and manually created search filters in filtering for scientifically sound relevant articles. The superior performance achieved by the cascade ensemble is a significant result that generalizes beyond this task and the dataset, and is analogous to query optimization in IR and databases.

下载PDF全文

下载文献需遵守相关版权规定

论文标题