GPU上图形算法的汇编技术

论文标题

GPU上图形算法的汇编技术

Compilation Techniques for Graph Algorithms on GPUs

论文作者

Brahmakshatriya, Ajay, Zhang, Yunming, Hong, Changwan, Kamil, Shoaib, Shun, Julian, Amarasinghe, Saman

论文摘要

图形程序的性能在很大程度上取决于算法，输入图的大小和结构以及基础硬件的功能。在所有设置中，没有一组优化或一个硬件平台可以很好地工作。为了获得高性能，程序员必须仔细选择要使用的一组优化和硬件平台。石墨编程语言使程序员可以轻松编写算法一次，并使用调度语言对不同的输入进行优化。但是，Graphit当前不支持为GPU生成高性能代码。程序员必须求助于以低级语言从头开始重新实现整个算法，并具有完全不同的抽象和优化集，以便在GPU上实现高性能。我们建议使用相同的算法规范在CPU和GPU上同时在CPU和GPU上达到高性能GG。 GG通过新颖的GPU调度语言和编译器可显着扩展GPU图形处理框架的优化空间，该语言和编译器能够结合GPU的图形优化。 GG还引入了两个性能优化：基于边缘的线程WARPS CTAS负载平衡（ETWC）和EdgeBlocking，以扩展GPU的优化空间。 ETWC通过将每个顶点的边缘动态划分为分配给线程，扭曲和CTA的块来改善负载平衡，以执行。通过重新排序边缘并限制随机内存访问以适合L2高速缓存，可以改善程序的局部性。我们在Pascal和Volta Generation Nvidia GPU上评估了5种算法和9个输入图的GG，并表明它在最先进的GPU图形处理框架上实现了高达5.11倍的速度，并且在90个实验中的66个最快。

The performance of graph programs depends highly on the algorithm, the size and structure of the input graphs, as well as the features of the underlying hardware. No single set of optimizations or one hardware platform works well across all settings. To achieve high performance, the programmer must carefully select which set of optimizations and hardware platforms to use. The GraphIt programming language makes it easy for the programmer to write the algorithm once and optimize it for different inputs using a scheduling language. However, GraphIt currently has no support for generating high performance code for GPUs. Programmers must resort to re-implementing the entire algorithm from scratch in a low-level language with an entirely different set of abstractions and optimizations in order to achieve high performance on GPUs. We propose GG, an extension to the GraphIt compiler framework, that achieves high performance on both CPUs and GPUs using the same algorithm specification. GG significantly expands the optimization space of GPU graph processing frameworks with a novel GPU scheduling language and compiler that enables combining graph optimizations for GPUs. GG also introduces two performance optimizations, Edge-based Thread Warps CTAs load balancing (ETWC) and EdgeBlocking, to expand the optimization space for GPUs. ETWC improves load balancing by dynamically partitioning the edges of each vertex into blocks that are assigned to threads, warps, and CTAs for execution. EdgeBlocking improves the locality of the program by reordering the edges and restricting random memory accesses to fit within the L2 cache. We evaluate GG on 5 algorithms and 9 input graphs on both Pascal and Volta generation NVIDIA GPUs, and show that it achieves up to 5.11x speedup over state-of-the-art GPU graph processing frameworks, and is the fastest on 66 out of the 90 experiments.

下载PDF全文

下载文献需遵守相关版权规定

论文标题