论文标题
基于目标的深层嵌入向量的分层聚类
Objective-Based Hierarchical Clustering of Deep Embedding Vectors
论文作者
论文摘要
我们启动了一项全面的实验研究,对大量数据集的基于目标的分层聚类方法,该方法包括来自计算机视觉和NLP应用程序的深层嵌入向量。其中包括各种各样的图像嵌入(Imagenet,Imagenetv2,nabirds),单词嵌入(Twitter,Wikipedia)和句子嵌入(SST-2)载体(例如Resnet,Resnet,Resnext,Inception v3,sbert,sbert,sbert)。我们的研究包括最高4500万美元条目的数据集,其嵌入式尺寸高达2048美元。 为了解决将层次聚类扩展到如此大的数据集的挑战,我们提出了一种新的实用层次聚类算法B ++&c。流行的Moseley-Wang(MW) / Cohen-Addad等人平均可提高5% / 20%。 (CKMM)目标(归一化)与广泛的经典方法和最近的启发式方法相比。我们还引入了一种理论算法B2SAT&C,该算法在多项式时间内实现了CKMM目标的0.74 $ approximation。这是对随机二进制树实现的微不足道$ 2/3 $ approximation的第一个实质性改进。在这项工作之前,$ \ $ \ 2/3 + 0.0004 $的最佳聚时近似归因于Charikar等人。 (Soda'19)。
We initiate a comprehensive experimental study of objective-based hierarchical clustering methods on massive datasets consisting of deep embedding vectors from computer vision and NLP applications. This includes a large variety of image embedding (ImageNet, ImageNetV2, NaBirds), word embedding (Twitter, Wikipedia), and sentence embedding (SST-2) vectors from several popular recent models (e.g. ResNet, ResNext, Inception V3, SBERT). Our study includes datasets with up to $4.5$ million entries with embedding dimensions up to $2048$. In order to address the challenge of scaling up hierarchical clustering to such large datasets we propose a new practical hierarchical clustering algorithm B++&C. It gives a 5%/20% improvement on average for the popular Moseley-Wang (MW) / Cohen-Addad et al. (CKMM) objectives (normalized) compared to a wide range of classic methods and recent heuristics. We also introduce a theoretical algorithm B2SAT&C which achieves a $0.74$-approximation for the CKMM objective in polynomial time. This is the first substantial improvement over the trivial $2/3$-approximation achieved by a random binary tree. Prior to this work, the best poly-time approximation of $\approx 2/3 + 0.0004$ was due to Charikar et al. (SODA'19).