FADEC：HW/SW共同设计的基于FPGA的视频深度估计加速度

论文标题

FADEC：HW/SW共同设计的基于FPGA的视频深度估计加速度

FADEC: FPGA-based Acceleration of Video Depth Estimation by HW/SW Co-design

论文作者

Hashimoto, Nobuho, Takamaeda-Yamazaki, Shinya

论文摘要

视频的3D重建已在各种应用中变得越来越流行，包括用于机器人和无人机的自动驾驶，增强现实（AR）和3D建模的导航。此任务通常结合了传统的图像/视频处理算法和深神经网络（DNNS）。尽管深度学习的最新发展提高了任务的准确性，但大量计算导致计算速度低和高功耗。尽管DNN有各种特定领域的硬件加速器，但是在传统的图像/视频处理算法和DNN之间交替的整个应用程序过程并不容易。因此，在低功率嵌入式环境中，这种复杂的应用需要基于FPGA的端到端加速度。本文提出了一种基于FPGA的新型加速器，用于DEEPVIDEOMV，这是一种基于DNN的3D重建方法。根据该方法的固有特征，我们使用HW/SW共同设计在现代SOC FPGA中适当利用异质组件，例如可编程逻辑（PL）和CPU。由于某些操作不适合硬件实施，我们通过分析执行每个操作的次数及其内存访问模式的次数来确定在软件中实施的操作，然后考虑全面的方面：硬件实现的易度性和预期加速度的硬件。硬件和软件实现是在PL和CPU上并行执行的，以隐藏其执行潜伏期。提出的加速器是在Xilinx ZCU104板上使用NNGEN（开源高级合成（HLS）工具开发的。实验表明，所提出的加速器的运行速度比在同一FPGA板上仅使用软件实现的速度快60.2倍，而精度降低的降低最小。

3D reconstruction from videos has become increasingly popular for various applications, including navigation for autonomous driving of robots and drones, augmented reality (AR), and 3D modeling. This task often combines traditional image/video processing algorithms and deep neural networks (DNNs). Although recent developments in deep learning have improved the accuracy of the task, the large number of calculations involved results in low computation speed and high power consumption. Although there are various domain-specific hardware accelerators for DNNs, it is not easy to accelerate the entire process of applications that alternate between traditional image/video processing algorithms and DNNs. Thus, FPGA-based end-to-end acceleration is required for such complicated applications in low-power embedded environments. This paper proposes a novel FPGA-based accelerator for DeepVideoMVS, a DNN-based depth estimation method for 3D reconstruction. We employ HW/SW co-design to appropriately utilize heterogeneous components in modern SoC FPGAs, such as programmable logic (PL) and CPU, according to the inherent characteristics of the method. As some operations are unsuitable for hardware implementation, we determine the operations to be implemented in software through analyzing the number of times each operation is performed and its memory access pattern, and then considering comprehensive aspects: the ease of hardware implementation and degree of expected acceleration by hardware. The hardware and software implementations are executed in parallel on the PL and CPU to hide their execution latencies. The proposed accelerator was developed on a Xilinx ZCU104 board by using NNgen, an open-source high-level synthesis (HLS) tool. Experiments showed that the proposed accelerator operates 60.2 times faster than the software-only implementation on the same FPGA board with minimal accuracy degradation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题