使用对象证据和反向字幕进行视频力矩本地化

论文标题

使用对象证据和反向字幕进行视频力矩本地化

Video Moment Localization using Object Evidence and Reverse Captioning

论文作者

Vidanapathirana, Madhawa, Pandhre, Supriya, Raychaudhuri, Sonia, Khurana, Anjali

论文摘要

我们解决了在未修剪视频中基于语言的时间定位的问题。与具有固定类别的时间定位相比，此问题更具挑战性，因为基于语言的查询没有预定义的活动类别，并且还可能包含复杂的描述。当前的最新模型MAC通过从视频和语言方式中挖掘活动概念来解决它。此方法在语言查询中从动词/对象对中编码语义活动概念，并从视频活动分类预测分数中利用视觉活动概念。我们提出了“多面视频定位器”（MML），这是Mac模型通过对象分割掩码引入视觉对象证据的扩展，并通过视频字幕启动视频理解功能。此外，我们在嵌入句子中改善了语言建模。我们在Charades-STA数据集上进行了实验，并确定MML在R@1和R@5metrics上的MAC基线分别优于4.93％和1.70％。我们的代码和预培训模型可在https://github.com/madhawav/mml上公开获得。

We address the problem of language-based temporal localization of moments in untrimmed videos. Compared to temporal localization with fixed categories, this problem is more challenging as the language-based queries have no predefined activity classes and may also contain complex descriptions. Current state-of-the-art model MAC addresses it by mining activity concepts from both video and language modalities. This method encodes the semantic activity concepts from the verb/object pair in a language query and leverages visual activity concepts from video activity classification prediction scores. We propose "Multi-faceted VideoMoment Localizer" (MML), an extension of MAC model by the introduction of visual object evidence via object segmentation masks and video understanding features via video captioning. Furthermore, we improve language modelling in sentence embedding. We experimented on Charades-STA dataset and identified that MML outperforms MAC baseline by 4.93% and 1.70% on R@1 and R@5metrics respectively. Our code and pre-trained model are publicly available at https://github.com/madhawav/MML.

下载PDF全文

下载文献需遵守相关版权规定

论文标题