论文标题

正确记忆:大规模数据处理的脆皮资源分配助手

Get Your Memory Right: The Crispy Resource Allocation Assistant for Large-Scale Data Processing

论文作者

Will, Jonathan, Thamsen, Lauritz, Bader, Jonathan, Scheinert, Dominik, Kao, Odej

论文摘要

Apache Spark和Apache之类的分布式数据流系统启用群集上大型数据集的数据并行处理。但是,为数据流工作选择适当的计算资源(既不会导致瓶颈也不导致资源较低的利用率),即使对于诸如数据工程师之类的专家用户来说,也常常具有挑战性。此外,现有的自动化资源选择方法取决于一个工作,即作业正在经常出现以从以前的运行中学习或保证全面测试的成本以学习。但是,由于许多工作太独特,因此这种假设通常不会成立。 因此,我们提出了一种酥脆,这是一种基于作业分析的数据处理群集配置的方法,仅在一台计算机上使用数据集的小样本运行。脆皮尝试推断完整数据集的内存使用量,然后选择具有足够总内存的群集配置。在我们在具有1031 Spark和Hadoop工作的数据集上的评估中,与基准相比,我们的工作执行成本降低了56%,而平均在消费级笔记本电脑上每次工作的分析运行少于10分钟。

Distributed dataflow systems like Apache Spark and Apache Hadoop enable data-parallel processing of large datasets on clusters. Yet, selecting appropriate computational resources for dataflow jobs -- that neither lead to bottlenecks nor to low resource utilization -- is often challenging, even for expert users such as data engineers. Further, existing automated approaches to resource selection rely on the assumption that a job is recurring to learn from previous runs or to warrant the cost of full test runs to learn from. However, this assumption often does not hold since many jobs are too unique. Therefore, we present Crispy, a method for optimizing data processing cluster configurations based on job profiling runs with small samples of the dataset on just a single machine. Crispy attempts to extrapolate the memory usage for the full dataset to then choose a cluster configuration with enough total memory. In our evaluation on a dataset with 1031 Spark and Hadoop jobs, we see a reduction of job execution costs by 56% compared to the baseline, while on average spending less than ten minutes on profiling runs per job on a consumer-grade laptop.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源