论文标题

基于目标的深层嵌入向量的分层聚类

Objective-Based Hierarchical Clustering of Deep Embedding Vectors

论文作者

Naumov, Stanislav, Yaroslavtsev, Grigory, Avdiukhin, Dmitrii

论文摘要

我们启动了一项全面的实验研究,对大量数据集的基于目标的分层聚类方法,该方法包括来自计算机视觉和NLP应用程序的深层嵌入向量。其中包括各种各样的图像嵌入(Imagenet,Imagenetv2,nabirds),单词嵌入(Twitter,Wikipedia)和句子嵌入(SST-2)载体(例如Resnet,Resnet,Resnext,Inception v3,sbert,sbert,sbert)。我们的研究包括最高4500万美元条目的数据集,其嵌入式尺寸高达2048美元。 为了解决将层次聚类扩展到如此大的数据集的挑战,我们提出了一种新的实用层次聚类算法B ++&c。流行的Moseley-Wang(MW) / Cohen-Addad等人平均可提高5% / 20%。 (CKMM)目标(归一化)与广泛的经典方法和最近的启发式方法相比。我们还引入了一种理论算法B2SAT&C,该算法在多项式时间内实现了CKMM目标的0.74 $ approximation。这是对随机二进制树实现的微不足道$ 2/3 $ approximation的第一个实质性改进。在这项工作之前,$ \ $ \ 2/3 + 0.0004 $的最佳聚时近似归因于Charikar等人。 (Soda'19)。

We initiate a comprehensive experimental study of objective-based hierarchical clustering methods on massive datasets consisting of deep embedding vectors from computer vision and NLP applications. This includes a large variety of image embedding (ImageNet, ImageNetV2, NaBirds), word embedding (Twitter, Wikipedia), and sentence embedding (SST-2) vectors from several popular recent models (e.g. ResNet, ResNext, Inception V3, SBERT). Our study includes datasets with up to $4.5$ million entries with embedding dimensions up to $2048$. In order to address the challenge of scaling up hierarchical clustering to such large datasets we propose a new practical hierarchical clustering algorithm B++&C. It gives a 5%/20% improvement on average for the popular Moseley-Wang (MW) / Cohen-Addad et al. (CKMM) objectives (normalized) compared to a wide range of classic methods and recent heuristics. We also introduce a theoretical algorithm B2SAT&C which achieves a $0.74$-approximation for the CKMM objective in polynomial time. This is the first substantial improvement over the trivial $2/3$-approximation achieved by a random binary tree. Prior to this work, the best poly-time approximation of $\approx 2/3 + 0.0004$ was due to Charikar et al. (SODA'19).

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源