扩展丢失的数据通过gans排名应用程序

论文标题

扩展丢失的数据通过gans排名应用程序

Extended Missing Data Imputation via GANs for Ranking Applications

论文作者

Deng, Grace, Han, Cuize, Matteson, David S.

论文摘要

我们提出条件插补gan，这是一种基于生成对抗网络（GAN）的扩展数据插补方法。激励的用例是学习到级别，现代搜索，推荐系统和信息检索应用程序的基石。经验排名数据集并不总是遵循标准的高斯分布或以随机（MCAR）机制完全丢失，这是经典缺少数据插补方法的标准假设。我们的方法提供了一个简单的解决方案，可以提供兼容的插补保证，同时放宽了缺失的机制和避免近似顽固性分布以提高插补质量的假设。我们证明，对于随机的（EMAR）的扩展丢失（EMAR）的扩展，始终以随机的（EAMAR）机制丢失，可以实现最佳的GAN归合。我们的方法证明了与最先进的基准和各种功能分布相比，开源微软研究排名（MSR）数据集（MSR）数据集和合成排名数据集的最高归合质量。使用专有的Amazon搜索排名数据集，我们还展示了与地面真实数据相比，在经过GAN输入数据训练的排名模型的排名质量指标。

We propose Conditional Imputation GAN, an extended missing data imputation method based on Generative Adversarial Networks (GANs). The motivating use case is learning-to-rank, the cornerstone of modern search, recommendation system, and information retrieval applications. Empirical ranking datasets do not always follow standard Gaussian distributions or Missing Completely At Random (MCAR) mechanism, which are standard assumptions of classic missing data imputation methods. Our methodology provides a simple solution that offers compatible imputation guarantees while relaxing assumptions for missing mechanisms and sidesteps approximating intractable distributions to improve imputation quality. We prove that the optimal GAN imputation is achieved for Extended Missing At Random (EMAR) and Extended Always Missing At Random (EAMAR) mechanisms, beyond the naive MCAR. Our method demonstrates the highest imputation quality on the open-source Microsoft Research Ranking (MSR) Dataset and a synthetic ranking dataset compared to state-of-the-art benchmarks and across various feature distributions. Using a proprietary Amazon Search ranking dataset, we also demonstrate comparable ranking quality metrics for ranking models trained on GAN-imputed data compared to ground-truth data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题