与数据选择偏见相关的聚类

论文标题

与数据选择偏见相关的聚类

Decorrelated Clustering with Data Selection Bias

论文作者

Wang, Xiao, Fan, Shaohua, Kuang, Kun, Shi, Chuan, Liu, Jiawei, Wang, Bai

论文摘要

提出了大多数现有的聚类算法，而无需考虑数据中的选择偏差。但是，在许多实际应用中，不能保证数据是公正的。选择偏差可能会带来特征之间的意外相关性，而忽略这些意外的相关性将损害聚类算法的性能。因此，如何消除选择偏差引起的那些意外相关性是极其重要但在很大程度上尚未探索聚类的。在本文中，我们提出了一种新型的去相关的正规化K-均值算法（DCKM），用于与数据选择偏置进行聚类。具体而言，去相关的正规器旨在学习能够平衡样本分布的全局样品权重，以消除特征之间的意外相关性。同时，学习的权重与K-均值结合在一起，这使得在固有的数据分布上重新持续的K-均值群集而没有意外的相关影响。此外，我们得出更新规则，以有效地推断DCKM中的参数。现实世界数据集的广泛实验很好地表明，我们的DCKM算法可实现显着的性能增长，这表明需要消除聚类时选择偏差引起的意外特征相关性。

Most of existing clustering algorithms are proposed without considering the selection bias in data. In many real applications, however, one cannot guarantee the data is unbiased. Selection bias might bring the unexpected correlation between features and ignoring those unexpected correlations will hurt the performance of clustering algorithms. Therefore, how to remove those unexpected correlations induced by selection bias is extremely important yet largely unexplored for clustering. In this paper, we propose a novel Decorrelation regularized K-Means algorithm (DCKM) for clustering with data selection bias. Specifically, the decorrelation regularizer aims to learn the global sample weights which are capable of balancing the sample distribution, so as to remove unexpected correlations among features. Meanwhile, the learned weights are combined with k-means, which makes the reweighted k-means cluster on the inherent data distribution without unexpected correlation influence. Moreover, we derive the updating rules to effectively infer the parameters in DCKM. Extensive experiments results on real world datasets well demonstrate that our DCKM algorithm achieves significant performance gains, indicating the necessity of removing unexpected feature correlations induced by selection bias when clustering.

下载PDF全文

下载文献需遵守相关版权规定

论文标题