论文标题

可能调整后的半决赛程序,用于聚类异质数据

Likelihood Adjusted Semidefinite Programs for Clustering Heterogeneous Data

论文作者

Zhuang, Yubo, Chen, Xiaohui, Yang, Yun

论文摘要

聚类是一种广泛部署的无监督学习工具。基于模型的聚类是一个灵活的框架,可以在簇具有不同的形状时应对数据异质性。基于可能性的混合物分布的推断通常涉及非凸和高维度功能,施加了困难的计算和统计挑战。经典的期望最大化(EM)算法是一种计算节俭的迭代方法,可最大化替代函数,替代每种迭代中观察到的数据的对数可能性,但是即使在标准高斯与常见的同位素同位素模型的特殊情况下,它也遭受了局部最大值,即使在常见的高斯混合物中,也遭受了常见的同位素同位素协方差矩阵。另一方面,最近的研究表明,半决赛编程(SDP)的独特全球解决方案放松$ k $ -MEANS在理论上实现了信息,从而在标准高斯混合物模型下完美地恢复了群集标签。在本文中,我们通过将群集标签集成为模型参数,并提出迭代可能性调整后的SDP(ILA-SDP)方法,将SDP方法扩展到一般设置,该方法在存在数据异质性的情况下直接最大程度地最大化了确切的观察到的可能性。通过将群集分配提升为特定组的成员矩阵,ILA-SDP避免了质心估计 - 一种关键功能,可以在质心良好的分离性下精确恢复而不会被其对抗性配置所困扰。因此,ILA-SDP对初始化的敏感性不如EM敏感,并且在高维数据上更稳定。我们的数字实验表明,ILA-SDP可以在几种广泛使用的聚类方法(包括$ K $ -MEANS,SDP和EM算法)上实现较低的错误聚类错误。

Clustering is a widely deployed unsupervised learning tool. Model-based clustering is a flexible framework to tackle data heterogeneity when the clusters have different shapes. Likelihood-based inference for mixture distributions often involves non-convex and high-dimensional objective functions, imposing difficult computational and statistical challenges. The classic expectation-maximization (EM) algorithm is a computationally thrifty iterative method that maximizes a surrogate function minorizing the log-likelihood of observed data in each iteration, which however suffers from bad local maxima even in the special case of the standard Gaussian mixture model with common isotropic covariance matrices. On the other hand, recent studies reveal that the unique global solution of a semidefinite programming (SDP) relaxed $K$-means achieves the information-theoretically sharp threshold for perfectly recovering the cluster labels under the standard Gaussian mixture model. In this paper, we extend the SDP approach to a general setting by integrating cluster labels as model parameters and propose an iterative likelihood adjusted SDP (iLA-SDP) method that directly maximizes the exact observed likelihood in the presence of data heterogeneity. By lifting the cluster assignment to group-specific membership matrices, iLA-SDP avoids centroids estimation -- a key feature that allows exact recovery under well-separateness of centroids without being trapped by their adversarial configurations. Thus iLA-SDP is less sensitive than EM to initialization and more stable on high-dimensional data. Our numeric experiments demonstrate that iLA-SDP can achieve lower mis-clustering errors over several widely used clustering methods including $K$-means, SDP and EM algorithms.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源