论文标题

基于抽样的分布式环境中不同值数量的估计

Sampling-based Estimation of the Number of Distinct Values in Distributed Environment

论文作者

Li, Jiajun, Wei, Zhewei, Ding, Bolin, Dai, Xiening, Lu, Lu, Zhou, Jingren

论文摘要

在数据挖掘中,估计不同值(NDV)的数量是各种应用程序的基本问题。现有的估算NDV的方法可以将其广泛分为两类:i)基于扫描的方法,该方法扫描整个数据并保持草图以近似NDV; ii)基于采样的方法,该方法使用采样数据估算NDV,而不是访问整个数据仓库。基于扫描的方法以较高的I/O和更多时间的成本达到较低的近似误差。在数据量较大的应用程序和由于其较高的可扩展性而具有允许的误差限制的应用中,基于采样的估计是可取的。但是,尽管基于采样的方法在单个机器上更有效,但在具有大量数据量的分布式环境中,它的实用性较小。为了获得最终的NDV估计器,必须在整个分布式系统中转移整个样品,在样本率显着时会产生过于良好的通信成本。本文提出了一种基于素描的新型分布式方法,该方法可在轻度假设下实现基于分布式采样的NDV估计的亚线性通信成本。我们的方法利用基于草图的算法来估计{\ em分布式流模型}中样品的{\ em频率},该算法与大多数基于经典采样的NDV估计器兼容。此外,我们还提供理论证据,证明我们方法在最坏情况下可以最大程度地降低沟通成本的能力。广泛的实验表明,与现有的采样和基于草图的方法相比,我们的方法节省了通信成本的数量级。

In data mining, estimating the number of distinct values (NDV) is a fundamental problem with various applications. Existing methods for estimating NDV can be broadly classified into two categories: i) scanning-based methods, which scan the entire data and maintain a sketch to approximate NDV; and ii) sampling-based methods, which estimate NDV using sampling data rather than accessing the entire data warehouse. Scanning-based methods achieve a lower approximation error at the cost of higher I/O and more time. Sampling-based estimation is preferable in applications with a large data volume and a permissible error restriction due to its higher scalability. However, while the sampling-based method is more effective on a single machine, it is less practical in a distributed environment with massive data volumes. For obtaining the final NDV estimators, the entire sample must be transferred throughout the distributed system, incurring a prohibitive communication cost when the sample rate is significant. This paper proposes a novel sketch-based distributed method that achieves sub-linear communication costs for distributed sampling-based NDV estimation under mild assumptions. Our method leverages a sketch-based algorithm to estimate the sample's {\em frequency of frequency} in the {\em distributed streaming model}, which is compatible with most classical sampling-based NDV estimators. Additionally, we provide theoretical evidence for our method's ability to minimize communication costs in the worst-case scenario. Extensive experiments show that our method saves orders of magnitude in communication costs compared to existing sampling- and sketch-based methods.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源