GPU负载平衡

论文标题

GPU负载平衡

GPU Load Balancing

论文作者

Osama, Muhammad

论文摘要

细粒度的工作量和资源平衡是GPU上常规和不规则计算的高性能的关键。在本文中，我们对现有的负载平衡技术进行了广泛的调查，以建立一个抽象，以解决GPU上调度计算的困难。我们提出了一个GPU精细颗粒负载平衡的抽象，该抽象将负载平衡与工作处理分解，并旨在通过可编程接口支持静态和动态时间表，以实现新的负载平衡时间表。在我们工作之前，释放GPU在不规则问题上的潜力的唯一方法是通过特定于应用程序特定的，紧密耦合的负载平衡技术进行工作负载平衡。借助我们用于负载平衡的开源框架，我们希望在GPU上开发不规则并行算法时提高程序员的生产率，并通过允许快速使用各种现有的负载平衡技术进行实验，从而改善此类应用程序的整体性能特征。利用来自负载平衡不规则工作负载的见解，我们构建了stream-k，这是密集线性代数中矩阵乘法（GEMM）（GEMM）和相关计算的以工作为中心的并行化。尽管当代分解主要是基于瓷砖的，但我们的方法是通过在物理处理元件之间划分均匀的内部循环迭代的均匀分布的。这提供了对计算资源的几乎完美利用，无论在基础处理元素中量化的任何给定问题的输出范围如何有效。在GPU处理器上，我们的流式GEMM平行化产生的峰值速度高达14倍和6.7倍，并且在32K GEMM问题的几何形状中，平均性能响应既高且更一致，远高于Cutlass和Cublas等先进的数学库。

Fine-grained workload and resource balancing is the key to high performance for regular and irregular computations on the GPUs. In this dissertation, we conduct an extensive survey of existing load-balancing techniques to build an abstraction that addresses the difficulty of scheduling computations on the GPU. We propose a GPU fine-grained load-balancing abstraction that decouples load balancing from work processing and aims to support both static and dynamic schedules with a programmable interface to implement new load-balancing schedules. Prior to our work, the only way to unleash the GPU's potential on irregular problems has been to workload-balance through application-specific, tightly coupled load-balancing techniques. With our open-source framework for load-balancing, we hope to improve programmers' productivity when developing irregular-parallel algorithms on the GPU, and also improve the overall performance characteristics for such applications by allowing a quick path to experimentation with a variety of existing load-balancing techniques. Using our insights from load-balancing irregular workloads, we build Stream-K, a work-centric parallelization of matrix multiplication (GEMM) and related computations in dense linear algebra. Whereas contemporary decompositions are primarily tile-based, our method operates by partitioning an even share of the aggregate inner loop iterations among physical processing elements. This provides a near-perfect utilization of computing resources, regardless of how efficiently the output tiling for any given problem quantizes across the underlying processing elements. On GPU processors, our Stream-K parallelization of GEMM produces a peak speedup of up to 14x and 6.7x, and an average performance response that is both higher and more consistent across 32K GEMM problem geometries than state-of-the-art math libraries such as CUTLASS and cuBLAS.

下载PDF全文

下载文献需遵守相关版权规定

论文标题