论文标题
体素R-CNN:朝向基于高性能体素的3D对象检测
Voxel R-CNN: Towards High Performance Voxel-based 3D Object Detection
论文作者
论文摘要
3D对象检测的最新进展在很大程度上取决于如何表示3D数据,\ emph {i.e。},基于体素或基于点的表示。许多现有的高性能3D检测器是基于点的,因为该结构可以更好地保留精确的点位置。然而,由于无序存储,点级特征导致高计算开销。相比之下,基于体素的结构更适合于特征提取,但由于输入数据被分为网格,因此通常会产生较低的精度。在本文中,我们采取了一个略有不同的观点 - 我们发现原始点的精确定位对于高性能3D对象检测并不是必不可少的,并且粗素粒度也可以提供足够的检测准确性。考虑到这种观点,我们设计了一个简单但有效的基于体素的框架,名为Voxel R-CNN。通过在两阶段方法中充分利用体素特征,我们的方法可以通过最新的基于点的模型实现可比的检测准确性,但以计算成本的一小部分获得了可比的检测精度。 Voxel R-CNN由一个3D骨干网络,2D鸟眼视图(BEV)区域提案网络和检测头组成。设计了体素ROI池,可以直接从体素特征提取ROI特征,以进一步细化。广泛的实验是在广泛使用的Kitti数据集和最新Waymo Open数据集上进行的。我们的结果表明,与现有的基于体素的方法相比,体素R-CNN在保持实时帧处理速率\ emph {i.e}。的同时,在NVIDIA RTX 2080 TI GPU上以25 fps的速度具有更高的检测准确性。该代码可在\ url {https://github.com/djiajunustc/voxel-r-cnn}中获得。
Recent advances on 3D object detection heavily rely on how the 3D data are represented, \emph{i.e.}, voxel-based or point-based representation. Many existing high performance 3D detectors are point-based because this structure can better retain precise point positions. Nevertheless, point-level features lead to high computation overheads due to unordered storage. In contrast, the voxel-based structure is better suited for feature extraction but often yields lower accuracy because the input data are divided into grids. In this paper, we take a slightly different viewpoint -- we find that precise positioning of raw points is not essential for high performance 3D object detection and that the coarse voxel granularity can also offer sufficient detection accuracy. Bearing this view in mind, we devise a simple but effective voxel-based framework, named Voxel R-CNN. By taking full advantage of voxel features in a two stage approach, our method achieves comparable detection accuracy with state-of-the-art point-based models, but at a fraction of the computation cost. Voxel R-CNN consists of a 3D backbone network, a 2D bird-eye-view (BEV) Region Proposal Network and a detect head. A voxel RoI pooling is devised to extract RoI features directly from voxel features for further refinement. Extensive experiments are conducted on the widely used KITTI Dataset and the more recent Waymo Open Dataset. Our results show that compared to existing voxel-based methods, Voxel R-CNN delivers a higher detection accuracy while maintaining a real-time frame processing rate, \emph{i.e}., at a speed of 25 FPS on an NVIDIA RTX 2080 Ti GPU. The code is available at \url{https://github.com/djiajunustc/Voxel-R-CNN}.