论文标题
深层多模式编码用于视频订购
Deep Multimodal Feature Encoding for Video Ordering
论文作者
论文摘要
对视频的真正理解来自对其所有方式的联合分析:视频帧,音频轨道和任何随附的文本(例如封闭字幕)。我们提供了一种学习编码所有这些模式的紧凑多模式特征表示的方法。我们的模型参数是通过在时间表中推断一组无序视频的时间顺序的代理任务来学习的。为此,我们创建了一个新的多模式数据集,用于时间订购,该数据集由大约30k场景(每个场景2-6个剪辑)组成,基于“大型电影描述挑战”。我们在三个具有挑战性的任务上分析和评估个人和联合方式:(i)推断一组视频的时间顺序; (ii)行动识别。我们从经验上证明,多模式表示确实是互补的,并且可以在改善许多应用程序的性能中发挥关键作用。
True understanding of videos comes from a joint analysis of all its modalities: the video frames, the audio track, and any accompanying text such as closed captions. We present a way to learn a compact multimodal feature representation that encodes all these modalities. Our model parameters are learned through a proxy task of inferring the temporal ordering of a set of unordered videos in a timeline. To this end, we create a new multimodal dataset for temporal ordering that consists of approximately 30K scenes (2-6 clips per scene) based on the "Large Scale Movie Description Challenge". We analyze and evaluate the individual and joint modalities on three challenging tasks: (i) inferring the temporal ordering of a set of videos; and (ii) action recognition. We demonstrate empirically that multimodal representations are indeed complementary, and can play a key role in improving the performance of many applications.