AI回馈统计数据？通过beta变量自动编码器发现单变量分布的坐标系

论文标题

AI回馈统计数据？通过beta变量自动编码器发现单变量分布的坐标系

AI Giving Back to Statistics? Discovery of the Coordinate System of Univariate Distributions by Beta Variational Autoencoder

论文作者

Glushkovsky, Alex

论文摘要

分布是扮演基本理论和实际角色的基本统计要素。本文讨论了训练神经网络的经验，以对单变量的经验分布进行分类，并根据累积分布函数（CDF）的输入的二维潜在空间强迫解散。潜在空间表示已使用无监督的beta变异自动编码器（beta-vae）进行。它分离了不同形状的分布，同时重叠相似的分布，并经验意识到理论上已知的分布之间的关系。已经进行了具有不同样本量和参数的单变量连续和离散（Bernoulli）分布的合成实验以支持这项研究。潜在二维坐标系上的表示形式可以看作是现实世界数据的附加元数据，这些元数据是删除重要分布特征的诸如CDF形状，基本分布的分类概率及其参数，信息熵和偏度的分类概率。熵改变，提供“时间箭头”，确定沿潜在空间上分布表示的动态轨迹。此外，基于后验重量（WOE）后验和标准同位素二维正常密度的潜在空间无监督的分割，已应用于检测可分配的原因，以区分出色的CDF输入。

Distributions are fundamental statistical elements that play essential theoretical and practical roles. The article discusses experiences of training neural networks to classify univariate empirical distributions and to represent them on the two-dimensional latent space forcing disentanglement based on the inputs of cumulative distribution functions (CDF). The latent space representation has been performed using an unsupervised beta variational autoencoder (beta-VAE). It separates distributions of different shapes while overlapping similar ones and empirically realises relationships between distributions that are known theoretically. The synthetic experiment of generated univariate continuous and discrete (Bernoulli) distributions with varying sample sizes and parameters has been performed to support the study. The representation on the latent two-dimensional coordinate system can be seen as an additional metadata of the real-world data that disentangles important distribution characteristics, such as shape of the CDF, classification probabilities of underlying theoretical distributions and their parameters, information entropy, and skewness. Entropy changes, providing an "arrow of time", determine dynamic trajectories along representations of distributions on the latent space. In addition, post beta-VAE unsupervised segmentation of the latent space based on weight-of-evidence (WOE) of posterior versus standard isotopic two-dimensional normal densities has been applied detecting the presence of assignable causes that distinguish exceptional CDF inputs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题