论文标题

$ \ ell_p $ PCA和光谱群集理论

An $\ell_p$ theory of PCA and spectral clustering

论文作者

Abbe, Emmanuel, Fan, Jianqing, Wang, Kaizheng

论文摘要

主成分分析(PCA)是统计和机器学习中的强大工具。虽然现有的PCA研究集中于主要成分及其相关特征值的恢复,但对单个主要成分得分的精确表征很少,这些分数得出样品的低维嵌入。这阻碍了各种光谱方法的分析。在本文中,我们首先为希尔伯特空间中的PCA挖空版本开发了一个$ \ ell_p $扰动理论,在存在异质噪声的情况下,它可以改善香草pca。通过新颖的$ \ ell_p $对特征向量的分析,我们研究了主成分得分向量的进入行为,并表明它们可以通过$ \ ell_p $ norm的革兰氏矩阵的线性函数近似,其中包括$ \ ell_2 $和$ \ ell_ \ ell_ \ ell_ \ ell_ \ ell_ \ ell_ \ iffty $作为特殊示例。对于次高斯的混合模型,提供最佳界限的$ p $的选择取决于信噪比,这进一步可为光谱群集提供最佳保证。对于上下文社区检测,$ \ ell_p $理论导致了一种简单的光谱算法,该算法实现了精确恢复的信息阈值。这些还为特殊情况提供了高斯混合物和随机块模型的最佳恢复结果。

Principal Component Analysis (PCA) is a powerful tool in statistics and machine learning. While existing study of PCA focuses on the recovery of principal components and their associated eigenvalues, there are few precise characterizations of individual principal component scores that yield low-dimensional embedding of samples. That hinders the analysis of various spectral methods. In this paper, we first develop an $\ell_p$ perturbation theory for a hollowed version of PCA in Hilbert spaces which provably improves upon the vanilla PCA in the presence of heteroscedastic noises. Through a novel $\ell_p$ analysis of eigenvectors, we investigate entrywise behaviors of principal component score vectors and show that they can be approximated by linear functionals of the Gram matrix in $\ell_p$ norm, which includes $\ell_2$ and $\ell_\infty$ as special examples. For sub-Gaussian mixture models, the choice of $p$ giving optimal bounds depends on the signal-to-noise ratio, which further yields optimality guarantees for spectral clustering. For contextual community detection, the $\ell_p$ theory leads to a simple spectral algorithm that achieves the information threshold for exact recovery. These also provide optimal recovery results for Gaussian mixture and stochastic block models as special cases.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源