视频动作识别的对齐引导的时间关注

论文标题

视频动作识别的对齐引导的时间关注

Alignment-guided Temporal Attention for Video Action Recognition

论文作者

Zhao, Yizhou, Li, Zhenyang, Guo, Xun, Lu, Yan

论文摘要

时间建模对于各种视频学习任务至关重要。最近的方法采用分解（2D+1D）或关节（3D）时空操作来从输入帧中提取时间上下文。尽管前者的计算效率更高，但后者通常会获得更好的性能。在本文中，我们将其归因于不同框架中各个位置之间相互作用的充分性和效率之间的困境。这些相互作用会影响与任务相关信息之间共享的框架中共享的提取。为了解决这个问题，我们证明，逐框对准有可能增加框架表示之间的共同信息，从而包括更多与任务相关的信息以提高有效性。然后，我们提出了对齐引导的时间注意（ATA），以扩展一维时间注意力，并在相邻框架之间使用无参数贴片级对齐。它可以充当图像骨架的一般插件，可以在没有任何特定模型设计的情况下执行动作识别任务。在多个基准上进行的广泛实验证明了我们模块的优势和通用性。

Temporal modeling is crucial for various video learning tasks. Most recent approaches employ either factorized (2D+1D) or joint (3D) spatial-temporal operations to extract temporal contexts from the input frames. While the former is more efficient in computation, the latter often obtains better performance. In this paper, we attribute this to a dilemma between the sufficiency and the efficiency of interactions among various positions in different frames. These interactions affect the extraction of task-relevant information shared among frames. To resolve this issue, we prove that frame-by-frame alignments have the potential to increase the mutual information between frame representations, thereby including more task-relevant information to boost effectiveness. Then we propose Alignment-guided Temporal Attention (ATA) to extend 1-dimensional temporal attention with parameter-free patch-level alignments between neighboring frames. It can act as a general plug-in for image backbones to conduct the action recognition task without any model-specific design. Extensive experiments on multiple benchmarks demonstrate the superiority and generality of our module.

下载PDF全文

下载文献需遵守相关版权规定

论文标题