动态抽样网络，可在视频中有效识别动作识别

论文标题

动态抽样网络，可在视频中有效识别动作识别

Dynamic Sampling Networks for Efficient Action Recognition in Videos

论文作者

Zheng, Yin-Dong, Liu, Zhaoyang, Lu, Tong, Wang, Limin

论文摘要

现有的动作识别方法主要基于剪辑级分类器，例如两流CNN或3D CNN，它们是从随机选择的夹子中训练的，并在测试过程中应用于密集采样夹。但是，对于培训分类器而言，此标准设置可能是次优的，并且在实践中部署时也需要大量的计算开销。为了解决这些问题，我们通过设计一个动态采样模块来提高学术夹级分类器的判别能力，并提高测试过程中的推理效率，提出了一个称为{\ em Dynamic采样网络}（DSN）的视频中的新框架进行动作识别框架。具体而言，DSN由采样模块和分类模块组成，其目的是学习采样策略，以便全面地选择哪些剪辑来保留和训练剪辑级分类器，以分别基于这些选定的剪辑来执行动作识别。特别是，给定输入视频，我们在关联增强学习设置中训练一个观察网络，以最大程度地提高所选剪辑的奖励，并正确预测。我们进行了广泛的实验，以研究四个动作识别数据集的DSN框架的不同方面：UCF101，HMDB51，Thumos14和ActivityNet V1.3。实验结果表明，DSN仅使用少于一半的夹子可以大大提高推论效率，这仍然可以获得与最先进方法的稍好或可比的识别精度。

The existing action recognition methods are mainly based on clip-level classifiers such as two-stream CNNs or 3D CNNs, which are trained from the randomly selected clips and applied to densely sampled clips during testing. However, this standard setting might be suboptimal for training classifiers and also requires huge computational overhead when deployed in practice. To address these issues, we propose a new framework for action recognition in videos, called {\em Dynamic Sampling Networks} (DSN), by designing a dynamic sampling module to improve the discriminative power of learned clip-level classifiers and as well increase the inference efficiency during testing. Specifically, DSN is composed of a sampling module and a classification module, whose objective is to learn a sampling policy to on-the-fly select which clips to keep and train a clip-level classifier to perform action recognition based on these selected clips, respectively. In particular, given an input video, we train an observation network in an associative reinforcement learning setting to maximize the rewards of the selected clips with a correct prediction. We perform extensive experiments to study different aspects of the DSN framework on four action recognition datasets: UCF101, HMDB51, THUMOS14, and ActivityNet v1.3. The experimental results demonstrate that DSN is able to greatly improve the inference efficiency by only using less than half of the clips, which can still obtain a slightly better or comparable recognition accuracy to the state-of-the-art approaches.

下载PDF全文

下载文献需遵守相关版权规定

论文标题