论文标题

Lachesis:以UDF为中心分析的自动分区

Lachesis: Automatic Partitioning for UDF-Centric Analytics

论文作者

Zou, Jia, Das, Amitabh, Barhate, Pratik, Iyengar, Arun, Yuan, Binhang, Jankov, Dimitrije, Jermaine, Chris

论文摘要

持续的分区可有效避免昂贵的洗牌操作。但是,对于广泛使用用户定义的功能(UDFS)的大数据分析工作负载,自动化此过程仍然是一个重大挑战,在这种情况下,与关系应用程序相比,很难重复使用子计算。此外,在以UDF为中心分析中无处不在的数据中,通常无法使用广泛用于分区选择的功能依赖性。我们提出了Lachesis系统,该系统代表以UDF为中心的工作负载作为可分析和可重复使用的子计算的工作流程。 Lachesis进一步采用了深厚的增强学习模型来推断应使用哪些子计算来分区基础数据。然后,将此分析应用于自动优化跨应用程序的数据存储,以提高性能和用户的生产率。

Persistent partitioning is effective in avoiding expensive shuffling operations. However it remains a significant challenge to automate this process for Big Data analytics workloads that extensively use user defined functions (UDFs), where sub-computations are hard to be reused for partitionings compared to relational applications. In addition, functional dependency that is widely utilized for partitioning selection is often unavailable in the unstructured data that is ubiquitous in UDF-centric analytics. We propose the Lachesis system, which represents UDF-centric workloads as workflows of analyzable and reusable sub-computations. Lachesis further adopts a deep reinforcement learning model to infer which sub-computations should be used to partition the underlying data. This analysis is then applied to automatically optimize the storage of the data across applications to improve the performance and users' productivity.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源