空中视频动作识别的基于频率的分离

论文标题

空中视频动作识别的基于频率的分离

Differentiable Frequency-based Disentanglement for Aerial Video Action Recognition

论文作者

Kothandaraman, Divya, Lin, Ming, Manocha, Dinesh

论文摘要

我们为视频中的人类活动识别提供了一种学习算法。我们的方法是为无人机视频而设计的，这些视频主要是从包含人类演员以及背景运动的倾斜的动态相机中获得的。通常，人类演员占据空间分辨率的十分之一。我们的方法同时利用了频域表示的好处，信号处理中的经典分析工具以及数据驱动的神经网络。在对视频中的显着静态和动态像素建模之前，我们构建了一个可区分的静态频率掩码，对于动作识别的基本任务至关重要。在启用神经网络之前，我们可以使用这种可区分的掩码，以通过身份损失函数本质地学习分离的特征表示。我们的公式使网络能够固有地计算其层中的分离显着特征。此外，我们提出了一个封装时间相关性和空间内容的成本功能，以对均匀间隔的视频片段中最重要的框架进行采样。我们在UAV人类数据集和NEC无人机数据集上进行了广泛的实验，并证明相对改善为5.72％-13.00％，比相应的基线模型比14.28％-38.05％。

We present a learning algorithm for human activity recognition in videos. Our approach is designed for UAV videos, which are mainly acquired from obliquely placed dynamic cameras that contain a human actor along with background motion. Typically, the human actors occupy less than one-tenth of the spatial resolution. Our approach simultaneously harnesses the benefits of frequency domain representations, a classical analysis tool in signal processing, and data driven neural networks. We build a differentiable static-dynamic frequency mask prior to model the salient static and dynamic pixels in the video, crucial for the underlying task of action recognition. We use this differentiable mask prior to enable the neural network to intrinsically learn disentangled feature representations via an identity loss function. Our formulation empowers the network to inherently compute disentangled salient features within its layers. Further, we propose a cost-function encapsulating temporal relevance and spatial content to sample the most important frame within uniformly spaced video segments. We conduct extensive experiments on the UAV Human dataset and the NEC Drone dataset and demonstrate relative improvements of 5.72% - 13.00% over the state-of-the-art and 14.28% - 38.05% over the corresponding baseline model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题