论文标题
FPGA上的记忆效率数据流推断的进化箱包装
Evolutionary Bin Packing for Memory-Efficient Dataflow Inference Acceleration on FPGA
论文作者
论文摘要
与CPU或GPU上的CNN执行相比,在现场可编程栅极阵列(FPGA)中实现的卷积神经网络(CNN)数据流推断加速器已证明能效率提高,并且潜伏期更低。但是,CNN参数记忆的复杂形状通常不能很好地映射到FPGA片内纪念(OCM),这会导致OCM利用率较差,并最终限制了可以在FPGA上有效加速的CNN的大小和类型。在这项工作中,我们提出了一种设计方法,该方法可以提高CNN参数对FPGA OCM的映射效率。我们将映射构架为垃圾箱包装问题,并确定传统的垃圾箱包装算法不太适合解决FPGA和CNN特异性约束中的问题。我们将遗传算法和模拟退火与传统的垃圾箱启发式方法杂交,以创建能够分组参数记忆的灵活映射器,以使每个组最佳地拟合FPGA在芯片上的记忆。我们在各种FPGA推理加速器上评估了这些算法。对于所有CNN用例,我们的混合映射器在几秒钟内收敛到最佳解决方案,对于Deep CNN的OCM利用效率高达65%,并且比当前最新的现有现有模拟的退火方法快200美元。
Convolutional neural network (CNN) dataflow inference accelerators implemented in Field Programmable Gate Arrays (FPGAs) have demonstrated increased energy efficiency and lower latency compared to CNN execution on CPUs or GPUs. However, the complex shapes of CNN parameter memories do not typically map well to FPGA on-chip memories (OCM), which results in poor OCM utilization and ultimately limits the size and types of CNNs which can be effectively accelerated on FPGAs. In this work, we present a design methodology that improves the mapping efficiency of CNN parameters to FPGA OCM. We frame the mapping as a bin packing problem and determine that traditional bin packing algorithms are not well suited to solve the problem within FPGA- and CNN-specific constraints. We hybridize genetic algorithms and simulated annealing with traditional bin packing heuristics to create flexible mappers capable of grouping parameter memories such that each group optimally fits FPGA on-chip memories. We evaluate these algorithms on a variety of FPGA inference accelerators. Our hybrid mappers converge to optimal solutions in a matter of seconds for all CNN use-cases, achieve an increase of up to 65% in OCM utilization efficiency for deep CNNs, and are up to 200$\times$ faster than current state-of-the-art simulated annealing approaches.