大数据查询处理的成本模型：学习，改造和我们的发现

论文标题

大数据查询处理的成本模型：学习，改造和我们的发现

Cost Models for Big Data Query Processing: Learning, Retrofitting, and Our Findings

论文作者

Siddiqui, Tarique, Jindal, Alekh, Qiao, Shi, Patel, Hiren, le, Wangchao

论文摘要

在现代云中，对大数据进行查询处理无处不在，在现代云中，系统使用基于成本的查询优化器选择了物理查询执行计划和运行这些计划所需的资源。因此，一个好的成本模型类似于提高资源效率和降低运营成本。不幸的是，微软的生产工作量表明，对于大数据系统的建模，成本非常复杂。在这项工作中，我们研究了两个关键问题：（i）我们可以学习大数据系统的准确成本模型，（ii）我们可以将学习模型集成到查询优化器中。为了回答这些问题，我们做出了三个核心贡献。首先，我们利用工作量模式来学习大量的个人成本模型，并将它们结合在一起，以在长期内实现高精度和覆盖范围。其次，我们建议在查询计划期间选择级联框架的扩展，以选择最佳资源，即容器数量。第三，我们将学习的成本模型集成到微软级别范围的级联查询优化器中。我们使用生产和TPC-H工作负载在生产环境中评估所得系统Cleo。我们的结果表明，学习成本模型的准确度更高2到3个数量级，而20倍与实际运行时间更相关，而计划更改的绝大多数（70％）会导致延迟和资源使用情况的实质性改善。

Query processing over big data is ubiquitous in modern clouds, where the system takes care of picking both the physical query execution plans and the resources needed to run those plans, using a cost-based query optimizer. A good cost model, therefore, is akin to better resource efficiency and lower operational costs. Unfortunately, the production workloads at Microsoft show that costs are very complex to model for big data systems. In this work, we investigate two key questions: (i) can we learn accurate cost models for big data systems, and (ii) can we integrate the learned models within the query optimizer. To answer these, we make three core contributions. First, we exploit workload patterns to learn a large number of individual cost models and combine them to achieve high accuracy and coverage over a long period. Second, we propose extensions to Cascades framework to pick optimal resources, i.e, number of containers, during query planning. And third, we integrate the learned cost models within the Cascade-style query optimizer of SCOPE at Microsoft. We evaluate the resulting system, Cleo, in a production environment using both production and TPC-H workloads. Our results show that the learned cost models are 2 to 3 orders of magnitude more accurate, and 20X more correlated with the actual runtimes, with a large majority (70%) of the plan changes leading to substantial improvements in latency as well as resource usage.

下载PDF全文

下载文献需遵守相关版权规定

论文标题