论文标题
智能框架选择动作识别
SMART Frame Selection for Action Recognition
论文作者
论文摘要
动作识别在计算上很昂贵。在本文中,我们解决了框架选择问题,以提高行动识别的准确性。特别是,我们表明,即使在修剪的视频域中,选择良好的框架也有助于行动识别性能。最近的工作已成功地利用了长期未修剪的视频,其中大部分内容都不相关且易于丢弃。但是,在这项工作中,我们专注于更标准的简短,修剪的动作识别问题。我们认为,良好的框架选择不仅可以降低计算成本识别的识别,还可以通过摆脱难以分类的框架来提高准确性。与以前的工作相反,我们提出了一种方法,该方法不是一次考虑一个框架,而是共同考虑它们。这会导致更有效的选择,其中良好的框架更有效地分布在视频中,例如讲故事的快照。我们将提出的框架选择智能称为SMART,并将其与不同的骨干架构和多个基准测试(动力学,Something Something,UCF101)结合使用。我们表明,与其他框架选择策略相比,智能框架选择始终提高准确性,同时将计算成本降低了4至10倍。此外,我们表明,当主要目标是识别性能时,我们的选择策略可以改善各种基准(UCF101,HMDB51,FCVID和ActivityNetnet)的最新最新模型和框架选择策略。
Action recognition is computationally expensive. In this paper, we address the problem of frame selection to improve the accuracy of action recognition. In particular, we show that selecting good frames helps in action recognition performance even in the trimmed videos domain. Recent work has successfully leveraged frame selection for long, untrimmed videos, where much of the content is not relevant, and easy to discard. In this work, however, we focus on the more standard short, trimmed action recognition problem. We argue that good frame selection can not only reduce the computational cost of action recognition but also increase the accuracy by getting rid of frames that are hard to classify. In contrast to previous work, we propose a method that instead of selecting frames by considering one at a time, considers them jointly. This results in a more efficient selection, where good frames are more effectively distributed over the video, like snapshots that tell a story. We call the proposed frame selection SMART and we test it in combination with different backbone architectures and on multiple benchmarks (Kinetics, Something-something, UCF101). We show that the SMART frame selection consistently improves the accuracy compared to other frame selection strategies while reducing the computational cost by a factor of 4 to 10 times. Additionally, we show that when the primary goal is recognition performance, our selection strategy can improve over recent state-of-the-art models and frame selection strategies on various benchmarks (UCF101, HMDB51, FCVID, and ActivityNet).