CPU-FPGA异质平台上的低延迟小型GNN推断

论文标题

CPU-FPGA异质平台上的低延迟小型GNN推断

Low-latency Mini-batch GNN Inference on CPU-FPGA Heterogeneous Platform

论文作者

Zhang, Bingyi, Zeng, Hanqing, Prasanna, Viktor

论文摘要

图形神经网络（GNNS）的迷你批次推断是许多现实世界应用中的关键问题。最近，已经提出了一种模型深度电场解耦的GNN设计原理，以解决众所周知的邻里爆炸问题。脱钩的GNN模型比原始模型具有更高的精度，并且在微型批次推理方面具有出色的可扩展性。我们将将GNN映射到CPU-FPGA异质平台上，以实现低延迟的迷你批次推断。 On the FPGA platform, we design a novel GNN hardware accelerator with an adaptive datapath denoted Adaptive Computation Kernel (ACK) that can execute various computation kernels of GNNs with low-latency: (1) for dense computation kernels expressed as matrix multiplication, ACK works as a systolic array with fully localized connections, (2) for sparse computation kernels, ACK follows散点聚集范式并用作多个平行管道，以支持图形的不规则连通性。提出的任务计划隐藏了CPU-FPGA数据通信开销，以减少推理延迟。我们开发了快速设计空间探索算法，以生成用于多个目标GNN模型的单个加速器。我们在最先进的CPU-FPGA平台上实现了加速器，并使用三种代表性模型（GCN，GraphSage和GAT）评估性能。结果表明，我们的CPU-FPGA实施实现$ 21.4-50.8 \ times $，$ 2.9-21.6 \ times $，$ 4.7 \ times $延迟降低与仅CPU，CPU-GPU，CPU-GPU和CPU-FPGA平台相比。

Mini-batch inference of Graph Neural Networks (GNNs) is a key problem in many real-world applications. Recently, a GNN design principle of model depth-receptive field decoupling has been proposed to address the well-known issue of neighborhood explosion. Decoupled GNN models achieve higher accuracy than original models and demonstrate excellent scalability for mini-batch inference. We map Decoupled GNNs onto CPU-FPGA heterogeneous platforms to achieve low-latency mini-batch inference. On the FPGA platform, we design a novel GNN hardware accelerator with an adaptive datapath denoted Adaptive Computation Kernel (ACK) that can execute various computation kernels of GNNs with low-latency: (1) for dense computation kernels expressed as matrix multiplication, ACK works as a systolic array with fully localized connections, (2) for sparse computation kernels, ACK follows the scatter-gather paradigm and works as multiple parallel pipelines to support the irregular connectivity of graphs. The proposed task scheduling hides the CPU-FPGA data communication overhead to reduce the inference latency. We develop a fast design space exploration algorithm to generate a single accelerator for multiple target GNN models. We implement our accelerator on a state-of-the-art CPU-FPGA platform and evaluate the performance using three representative models (GCN, GraphSAGE, and GAT). Results show that our CPU-FPGA implementation achieves $21.4-50.8\times$, $2.9-21.6\times$, $4.7\times$ latency reduction compared with state-of-the-art implementations on CPU-only, CPU-GPU and CPU-FPGA platforms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题