关于概念的可学习性：随着比较单词嵌入算法的应用

论文标题

关于概念的可学习性：随着比较单词嵌入算法的应用

On the Learnability of Concepts: With Applications to Comparing Word Embedding Algorithms

论文作者

Sutton, Adam, Cristianini, Nello

论文摘要

单词嵌入在多种自然语言处理（NLP）应用中广泛使用。它们是与字典中每个单词相关联的坐标，这是根据这些单词在大型语料库中的统计属性推断出的。在本文中，我们将“概念”的概念介绍为具有共享语义内容的单词列表。我们使用此概念来分析某些概念的可学习性，该概念的能力定义为在随机子集的随机子集上训练后识别看不见的概念成员。我们首先使用这种方法来测量概念术的嵌入方式的可学习性。然后，我们基于假设测试和ROC曲线对概念可学习性进行统计分析，以便使用固定的Corpora和Hulter参数比较各种嵌入算法的相对优点。我们发现所有嵌入方法都捕获了这些单词列表的语义内容，但是FastText的性能比其他单词列表更好。

Word Embeddings are used widely in multiple Natural Language Processing (NLP) applications. They are coordinates associated with each word in a dictionary, inferred from statistical properties of these words in a large corpus. In this paper we introduce the notion of "concept" as a list of words that have shared semantic content. We use this notion to analyse the learnability of certain concepts, defined as the capability of a classifier to recognise unseen members of a concept after training on a random subset of it. We first use this method to measure the learnability of concepts on pretrained word embeddings. We then develop a statistical analysis of concept learnability, based on hypothesis testing and ROC curves, in order to compare the relative merits of various embedding algorithms using a fixed corpora and hyper parameters. We find that all embedding methods capture the semantic content of those word lists, but fastText performs better than the others.

下载PDF全文

下载文献需遵守相关版权规定

论文标题