多相机校准免费BEV表示3D对象检测

论文标题

多相机校准免费BEV表示3D对象检测

Multi-Camera Calibration Free BEV Representation for 3D Object Detection

论文作者

Jiang, Hongxiang, Meng, Wenming, Zhu, Hongmei, Zhang, Qian, Yin, Jihao

论文摘要

在自主驾驶的高级范式中，从周围的视图中学习鸟类视图（BEV）表示对于多任务框架至关重要。但是，基于深度估计或相机驱动的注意力的现有方法不稳定，无法在嘈杂的摄像机参数下进行转换，主要是两个挑战，准确的深度预测和校准。在这项工作中，我们提供了一个完全多摄像机校准的无校准变压器（CFT），用于鲁棒的BEV表示，该表示的重点是探索隐式映射，而不是依赖摄像机的内在和外部设备。为了指导从图像视图到BEV的更好的特征学习，CFT矿山潜在的3D信息通过我们设计的位置吸引增强（PA）。我们提出了一个视图引人注目的关注，而不是在更有效的区域内进行互动和较低的计算成本，而不是摄像机驱动的点或全局变换，而不是降低冗余计算并促进融合。 CFT在Nuscenes检测任务排行榜上达到了49.7％的NDS，这是第一个删除相机参数的工作，可与其他几何引导的方法相当。如果没有时间输入和其他模态信息，CFT可以使用较小的图像输入1600 * 640实现第二高性能。多亏了注意力集中的变体，CFT将记忆力和变压器拖鞋分别降低了大约12％和60％，将NDS提高了1.0％。此外，它对嘈杂的相机参数的自然鲁棒性使CFT更具竞争力。

In advanced paradigms of autonomous driving, learning Bird's Eye View (BEV) representation from surrounding views is crucial for multi-task framework. However, existing methods based on depth estimation or camera-driven attention are not stable to obtain transformation under noisy camera parameters, mainly with two challenges, accurate depth prediction and calibration. In this work, we present a completely Multi-Camera Calibration Free Transformer (CFT) for robust BEV representation, which focuses on exploring implicit mapping, not relied on camera intrinsics and extrinsics. To guide better feature learning from image views to BEV, CFT mines potential 3D information in BEV via our designed position-aware enhancement (PA). Instead of camera-driven point-wise or global transformation, for interaction within more effective region and lower computation cost, we propose a view-aware attention which also reduces redundant computation and promotes converge. CFT achieves 49.7% NDS on the nuScenes detection task leaderboard, which is the first work removing camera parameters, comparable to other geometry-guided methods. Without temporal input and other modal information, CFT achieves second highest performance with a smaller image input 1600 * 640. Thanks to view-attention variant, CFT reduces memory and transformer FLOPs for vanilla attention by about 12% and 60%, respectively, with improved NDS by 1.0%. Moreover, its natural robustness to noisy camera parameters makes CFT more competitive.

下载PDF全文

下载文献需遵守相关版权规定

论文标题