通过硬件和算法共同设计适用于基于注意的NNS的适应性蝴蝶加速器

论文标题

通过硬件和算法共同设计适用于基于注意的NNS的适应性蝴蝶加速器

Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and Algorithm Co-design

论文作者

Fan, Hongxiang, Chau, Thomas, Venieris, Stylianos I., Lee, Royson, Kouris, Alexandros, Luk, Wayne, Lane, Nicholas D., Abdelfattah, Mohamed S.

论文摘要

基于注意力的神经网络在许多AI任务中都普遍存在。尽管其出色的算法性能，但注意力机制和前馈网络（FFN）的使用仍需要过多的计算和内存资源，这通常会损害其硬件性能。尽管已经引入了各种稀疏变体，但大多数方法仅着重于减轻算法级别上的二次注意力缩放，而无需明确考虑将其方法映射到真实硬件设计上的效率。此外，大多数努力仅专注于注意机制或FFN，但没有共同优化这两个部分，导致当前的大多数设计在处理不同的输入长度时缺乏可扩展性。本文从硬件角度系统地考虑了不同变体中的稀疏模式。在算法级别上，我们提出了Fabnet，这是一种适合硬件的变体，它采用统一的蝴蝶稀疏模式来近似关注机制和FFN。在硬件级别上，提出了一种新颖的适应性蝴蝶加速器，可以在运行时通过专用硬件控件配置，以使用单个统一的硬件引擎加速不同的蝴蝶层。在远程 - ARENA数据集上，FabNet达到了与香草变压器相同的准确性，同时将计算量减少10至66次，参数数量为2至22次。通过共同优化算法和硬件，我们的基于FPGA的蝴蝶加速器可以在最先进的加速器上达到14.2至23.2倍的速度，该加速器正常于相同的计算预算。与Raspberry Pi 4和Jetson Nano上优化的CPU和GPU设计相比，我们的系统在相同的功率预算下的最大273.8和15.1倍。

Attention-based neural networks have become pervasive in many AI tasks. Despite their excellent algorithmic performance, the use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources, which often compromises their hardware performance. Although various sparse variants have been introduced, most approaches only focus on mitigating the quadratic scaling of attention on the algorithm level, without explicitly considering the efficiency of mapping their methods on real hardware designs. Furthermore, most efforts only focus on either the attention mechanism or the FFNs but without jointly optimizing both parts, causing most of the current designs to lack scalability when dealing with different input lengths. This paper systematically considers the sparsity patterns in different variants from a hardware perspective. On the algorithmic level, we propose FABNet, a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs. On the hardware level, a novel adaptable butterfly accelerator is proposed that can be configured at runtime via dedicated hardware control to accelerate different butterfly layers using a single unified hardware engine. On the Long-Range-Arena dataset, FABNet achieves the same accuracy as the vanilla Transformer while reducing the amount of computation by 10 to 66 times and the number of parameters 2 to 22 times. By jointly optimizing the algorithm and hardware, our FPGA-based butterfly accelerator achieves 14.2 to 23.2 times speedup over state-of-the-art accelerators normalized to the same computational budget. Compared with optimized CPU and GPU designs on Raspberry Pi 4 and Jetson Nano, our system is up to 273.8 and 15.1 times faster under the same power budget.

下载PDF全文

下载文献需遵守相关版权规定

论文标题