论文标题

层次定性聚类:将混合数据集与关键定性信息进行聚类

Hierarchical Qualitative Clustering: clustering mixed datasets with critical qualitative information

论文作者

Seca, Diogo, Mendes-Moreira, João, Mendes-Neves, Tiago, Sousa, Ricardo

论文摘要

聚类可用于从数据中提取见解或验证域专家(即数据分割)所持有的一些假设。在文献中,使用与数据中存在的其他变量关联的上下文而不会丢失可解释性的上下文,几乎没有方法可以应用于群集定性值。此外,对于高维混合数据集,定性值之间计算差异的指标通常缩放较差。 在这项研究中,我们提出了一种基于层次聚类(HQC)和使用最大平均差异的新方法来群集定性值。 HQC保持数据集中存在的定性信息的原始解释性。我们将HQC应用于两个数据集。使用Spotify提供的混合数据集,我们展示了如何根据数千首歌曲的定量功能来将我们的方法用于聚集音乐艺术家。此外,使用公司的财务特征,我们将公司行业集中,并讨论投资组合多元化的影响。

Clustering can be used to extract insights from data or to verify some of the assumptions held by the domain experts, namely data segmentation. In the literature, few methods can be applied in clustering qualitative values using the context associated with other variables present in the data, without losing interpretability. Moreover, the metrics for calculating dissimilarity between qualitative values often scale poorly for high dimensional mixed datasets. In this study, we propose a novel method for clustering qualitative values, based on Hierarchical Clustering (HQC), and using Maximum Mean Discrepancy. HQC maintains the original interpretability of the qualitative information present in the dataset. We apply HQC to two datasets. Using a mixed dataset provided by Spotify, we showcase how our method can be used for clustering music artists based on the quantitative features of thousands of songs. In addition, using financial features of companies, we cluster company industries, and discuss the implications in investment portfolios diversification.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源