基于目标的深层嵌入向量的分层聚类

论文标题

基于目标的深层嵌入向量的分层聚类

Objective-Based Hierarchical Clustering of Deep Embedding Vectors

论文作者

Naumov, Stanislav, Yaroslavtsev, Grigory, Avdiukhin, Dmitrii

论文摘要

我们启动了一项全面的实验研究，对大量数据集的基于目标的分层聚类方法，该方法包括来自计算机视觉和NLP应用程序的深层嵌入向量。其中包括各种各样的图像嵌入（Imagenet，Imagenetv2，nabirds），单词嵌入（Twitter，Wikipedia）和句子嵌入（SST-2）载体（例如Resnet，Resnet，Resnext，Inception v3，sbert，sbert，sbert）。我们的研究包括最高4500万美元条目的数据集，其嵌入式尺寸高达2048美元。为了解决将层次聚类扩展到如此大的数据集的挑战，我们提出了一种新的实用层次聚类算法B ++＆c。流行的Moseley-Wang（MW） / Cohen-Addad等人平均可提高5％ / 20％。（CKMM）目标（归一化）与广泛的经典方法和最近的启发式方法相比。我们还引入了一种理论算法B2SAT＆C，该算法在多项式时间内实现了CKMM目标的0.74 $ approximation。这是对随机二进制树实现的微不足道$ 2/3 $ approximation的第一个实质性改进。在这项工作之前，$ \ $ \ 2/3 + 0.0004 $的最佳聚时近似归因于Charikar等人。（Soda'19）。

We initiate a comprehensive experimental study of objective-based hierarchical clustering methods on massive datasets consisting of deep embedding vectors from computer vision and NLP applications. This includes a large variety of image embedding (ImageNet, ImageNetV2, NaBirds), word embedding (Twitter, Wikipedia), and sentence embedding (SST-2) vectors from several popular recent models (e.g. ResNet, ResNext, Inception V3, SBERT). Our study includes datasets with up to $4.5$ million entries with embedding dimensions up to $2048$. In order to address the challenge of scaling up hierarchical clustering to such large datasets we propose a new practical hierarchical clustering algorithm B++&C. It gives a 5%/20% improvement on average for the popular Moseley-Wang (MW) / Cohen-Addad et al. (CKMM) objectives (normalized) compared to a wide range of classic methods and recent heuristics. We also introduce a theoretical algorithm B2SAT&C which achieves a $0.74$-approximation for the CKMM objective in polynomial time. This is the first substantial improvement over the trivial $2/3$-approximation achieved by a random binary tree. Prior to this work, the best poly-time approximation of $\approx 2/3 + 0.0004$ was due to Charikar et al. (SODA'19).

下载PDF全文

下载文献需遵守相关版权规定

论文标题