论文标题
LBL2VEC:一种基于嵌入的方法,用于无监督的文档检索预定义主题
Lbl2Vec: An Embedding-Based Approach for Unsupervised Document Retrieval on Predefined Topics
论文作者
论文摘要
在本文中,我们考虑使用无标记的文档数据集检索文档的任务,使用无监督的方法。所提出的无监督方法只需要描述各自主题的少数关键字,而没有标记的文档。现有的方法在很大程度上依赖大量编码的世界知识,也可以依赖于期限纪录的频率。相反,我们引入了一种方法,该方法仅从未标记的文档数据集中学习共同嵌入的文档和单词向量,以查找与关键字所描述的主题上相似的文档。所提出的方法几乎不需要文本预处理,但同时有效地以高可能性检索相关文档。当从公共可用和常用数据集中连续检索不同预定义主题的文档时,我们在一个数据集上达到了接收器操作特性曲线值为0.95的平均面积,而另一个数据集则达到0.92。此外,我们的方法可用于多类文档分类,而无需提前将标签分配给数据集。与无监督的分类基线相比,我们在相应的数据集中将F1分数从76.6增加到82.7,从61.0增加到61.0。为了轻松复制我们的方法,我们将开发的LBL2VEC代码公开作为第3条规定BSD许可下的即用工具。
In this paper, we consider the task of retrieving documents with predefined topics from an unlabeled document dataset using an unsupervised approach. The proposed unsupervised approach requires only a small number of keywords describing the respective topics and no labeled document. Existing approaches either heavily relied on a large amount of additionally encoded world knowledge or on term-document frequencies. Contrariwise, we introduce a method that learns jointly embedded document and word vectors solely from the unlabeled document dataset in order to find documents that are semantically similar to the topics described by the keywords. The proposed method requires almost no text preprocessing but is simultaneously effective at retrieving relevant documents with high probability. When successively retrieving documents on different predefined topics from publicly available and commonly used datasets, we achieved an average area under the receiver operating characteristic curve value of 0.95 on one dataset and 0.92 on another. Further, our method can be used for multiclass document classification, without the need to assign labels to the dataset in advance. Compared with an unsupervised classification baseline, we increased F1 scores from 76.6 to 82.7 and from 61.0 to 75.1 on the respective datasets. For easy replication of our approach, we make the developed Lbl2Vec code publicly available as a ready-to-use tool under the 3-Clause BSD license.