DFA：有效视频对象检测的动态功能聚合

论文标题

DFA：有效视频对象检测的动态功能聚合

DFA: Dynamic Feature Aggregation for Efficient Video Object Detection

论文作者

Cui, Yiming

论文摘要

视频对象检测是计算机视觉中的一项基本而又具有挑战性的任务。一种实用的解决方案是利用视频中的时间信息并应用功能聚合以增强每个帧中的对象特征。尽管有效，但这些现有方法始终遭受低推理速度的影响，因为它们使用固定数量的帧进行特征聚合，而不论输入框架如何。因此，本文旨在提高当前基于功能聚合的视频对象检测器的推理速度，同时保持其性能。为了实现这一目标，我们提出了一个香草动态聚合模块，该模块可自适应选择框架以增强功能。然后，我们将香草动态聚合模块扩展到更有效，可重新配置的可变形版本。最后，我们介绍了内置蒸馏损失，以改善较少框架汇总的对象的表示。广泛的实验结果验证了我们提出的方法的有效性和效率：在Imagenet VID基准上，与我们所提出的方法集成在一起，FGFA和SELSA可以分别提高31％和76％的推理速度，同时在准确性上获得可比的性能。

Video object detection is a fundamental yet challenging task in computer vision. One practical solution is to take advantage of temporal information from the video and apply feature aggregation to enhance the object features in each frame. Though effective, those existing methods always suffer from low inference speeds because they use a fixed number of frames for feature aggregation regardless of the input frame. Therefore, this paper aims to improve the inference speed of the current feature aggregation-based video object detectors while maintaining their performance. To achieve this goal, we propose a vanilla dynamic aggregation module that adaptively selects the frames for feature enhancement. Then, we extend the vanilla dynamic aggregation module to a more effective and reconfigurable deformable version. Finally, we introduce inplace distillation loss to improve the representations of objects aggregated with fewer frames. Extensive experimental results validate the effectiveness and efficiency of our proposed methods: On the ImageNet VID benchmark, integrated with our proposed methods, FGFA and SELSA can improve the inference speed by 31% and 76% respectively while getting comparable performance on accuracy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题