论文标题

与谓词构成的快速可靠的丢失数据应变分析

Fast and Reliable Missing Data Contingency Analysis with Predicate-Constraints

论文作者

Liang, Xi, Shang, Zechao, Elmore, Aaron J., Krishnan, Sanjay, Franklin, Michael J.

论文摘要

如今,数据分析师在很大程度上依赖直觉来确定数据集的丢失或扣留行是否会显着影响其分析。我们提出了一个可以产生自动应变分析的框架,即,在描述缺失数据元组的变化和频率的正式约束下,汇总SQL查询可以采用的值范围。我们描述了如何在这些条件下处理总和,计数,AVG,最低和最大查询,从而导致硬误差范围和可测试的约束。我们提出了一种基于整数程序的优化算法,该算法即使它们重叠,冲突或不满意的范围,该算法即使它们重叠,冲突或不满意。我们针对几个统计插补和推理基准的现实世界数据集进行的实验表明,统计技术可以具有欺骗性的高错误率,通常是无法预测的。相反,我们的框架提供的艰难界限,如果不违反约束,可以保证会保持。尽管存在这些硬界,但我们对统计基线显示出竞争性的准确性。

Today, data analysts largely rely on intuition to determine whether missing or withheld rows of a dataset significantly affect their analyses. We propose a framework that can produce automatic contingency analysis, i.e., the range of values an aggregate SQL query could take, under formal constraints describing the variation and frequency of missing data tuples. We describe how to process SUM, COUNT, AVG, MIN, and MAX queries in these conditions resulting in hard error bounds with testable constraints. We propose an optimization algorithm based on an integer program that reconciles a set of such constraints, even if they are overlapping, conflicting, or unsatisfiable, into such bounds. Our experiments on real-world datasets against several statistical imputation and inference baselines show that statistical techniques can have a deceptively high error rate that is often unpredictable. In contrast, our framework offers hard bounds that are guaranteed to hold if the constraints are not violated. In spite of these hard bounds, we show competitive accuracy to statistical baselines.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源