结构化的两流注意网络用于视频问题回答

论文标题

结构化的两流注意网络用于视频问题回答

Structured Two-stream Attention Network for Video Question Answering

论文作者

Gao, Lianli, Zeng, Pengpeng, Song, Jingkuan, Li, Yuan-Fang, Liu, Wu, Mei, Tao, Shen, Heng Tao

论文摘要

迄今为止，视觉问题回答（VQA）（即图像质量质量质量质量检查和视频质量质量质量检查）仍然是视觉和语言理解的圣杯，尤其是对于视频质量质量检查。与图像质量质量质量检查相比，主要侧重于了解图像区域级别细节和相应问题之间的关联，视频质量质量质量质量质量检查需要一个模型来共同理解视频的空间和远程时间结构，以及文本以及提供准确的答案。在本文中，我们通过提出一个结构化的两流注意网络（即STA）来回答有关给定视频内容的自由形式或开放式的自然语言问题，从而解决了视频质量检查问题。首先，我们使用我们的结构化段组件和编码文本功能来推断视频中富含的远程时间结构。然后，我们结构化的两流注意力组成部分同时定位了重要的视觉实例，减少了背景视频的影响，并专注于相关文本。最后，结构化的两流融合组件结合了查询和视频意识上下文表示的不同段，并渗透答案。大规模视频QA数据集\ TextIt {TGIF-QA}的实验表明，我们所提出的方法显着超过了最佳对应物（即视频输入的一种表示），13.0％，13.5％，11.0％，11.0％和0.3，用于动作，trans。，Trans。，Trame.，Trame.QA和Count Tasks。它还优于最佳竞争对手（即具有两种表示），即trans。，trameqa任务的最佳竞争对手4.1％，4.7％和5.1％。

To date, visual question answering (VQA) (i.e., image QA and video QA) is still a holy grail in vision and language understanding, especially for video QA. Compared with image QA that focuses primarily on understanding the associations between image region-level details and corresponding questions, video QA requires a model to jointly reason across both spatial and long-range temporal structures of a video as well as text to provide an accurate answer. In this paper, we specifically tackle the problem of video QA by proposing a Structured Two-stream Attention network, namely STA, to answer a free-form or open-ended natural language question about the content of a given video. First, we infer rich long-range temporal structures in videos using our structured segment component and encode text features. Then, our structured two-stream attention component simultaneously localizes important visual instance, reduces the influence of background video and focuses on the relevant text. Finally, the structured two-stream fusion component incorporates different segments of query and video aware context representation and infers the answers. Experiments on the large-scale video QA dataset \textit{TGIF-QA} show that our proposed method significantly surpasses the best counterpart (i.e., with one representation for the video input) by 13.0%, 13.5%, 11.0% and 0.3 for Action, Trans., TrameQA and Count tasks. It also outperforms the best competitor (i.e., with two representations) on the Action, Trans., TrameQA tasks by 4.1%, 4.7%, and 5.1%.

下载PDF全文

下载文献需遵守相关版权规定

论文标题