OTSEQ2SET：极端多标签文本分类的最佳传输增强序列到集合模型

论文标题

OTSEQ2SET：极端多标签文本分类的最佳传输增强序列到集合模型

OTSeq2Set: An Optimal Transport Enhanced Sequence-to-Set Model for Extreme Multi-label Text Classification

论文作者

Cao, Jie, Zhang, Yin

论文摘要

极端的多标签文本分类（XMTC）是从非常大的标签集合中找到最相关的子集标签的任务。最近，一些深度学习模型已在XMTC任务中实现了最新的结果。这些模型通常通过完全连接的层作为模型的最后一层预测所有标签的得分。但是，这样的模型无法预测每个文档的相对完整和可变长度的标签子集，因为它们选择了通过固定阈值与文档相关的正面标签，也可以按照分数的降序订购最高的K标签。一种不流行的深度学习模型，称为序列到序列（SEQ2SEQ），重点是预测序列样式可变长度的正面标签。但是，XMTC任务中的标签本质上是一个无序的集合，而不是有序的序列，标签的默认顺序限制了训练中的SEQ2SEQ模型。为了解决SEQ2SEQ中的此限制，我们为名为OTSEQ2SET的XMTC任务提出了自回归序列到集合的模型。我们的模型在学生努力方案中产生预测，并通过基于双方匹配的损失函数进行培训，该匹配可以使置换不变。同时，我们使用最佳传输距离作为测量，迫使模型专注于语义标签空间中最接近的标签。实验表明，OTSEQ2SET在4个基准数据集上优于其他竞争基线。特别是，在带有31k标签的Wikipedia数据集上，它在Micro-F1分数中优于最先进的SEQ2SEQ方法。该代码可在https://github.com/caojie54/otseq2set上找到。

Extreme multi-label text classification (XMTC) is the task of finding the most relevant subset labels from an extremely large-scale label collection. Recently, some deep learning models have achieved state-of-the-art results in XMTC tasks. These models commonly predict scores for all labels by a fully connected layer as the last layer of the model. However, such models can't predict a relatively complete and variable-length label subset for each document, because they select positive labels relevant to the document by a fixed threshold or take top k labels in descending order of scores. A less popular type of deep learning models called sequence-to-sequence (Seq2Seq) focus on predicting variable-length positive labels in sequence style. However, the labels in XMTC tasks are essentially an unordered set rather than an ordered sequence, the default order of labels restrains Seq2Seq models in training. To address this limitation in Seq2Seq, we propose an autoregressive sequence-to-set model for XMTC tasks named OTSeq2Set. Our model generates predictions in student-forcing scheme and is trained by a loss function based on bipartite matching which enables permutation-invariance. Meanwhile, we use the optimal transport distance as a measurement to force the model to focus on the closest labels in semantic label space. Experiments show that OTSeq2Set outperforms other competitive baselines on 4 benchmark datasets. Especially, on the Wikipedia dataset with 31k labels, it outperforms the state-of-the-art Seq2Seq method by 16.34% in micro-F1 score. The code is available at https://github.com/caojie54/OTSeq2Set.

下载PDF全文

下载文献需遵守相关版权规定

论文标题