从像素到对象：视觉问题的立方视觉关注回答

论文标题

从像素到对象：视觉问题的立方视觉关注回答

From Pixels to Objects: Cubic Visual Attention for Visual Question Answering

论文作者

Song, Jingkuan, Zeng, Pengpeng, Gao, Lianli, Shen, Heng Tao

论文摘要

最近，基于注意力的视觉问题答案（VQA）通过利用问题来选择性地针对与答案相关的不同视觉领域取得了巨大的成功。现有的视觉注意力模型通常是平面，即图像的最后一个相关特征图的不同通道共享相同的重量。这与注意力机制相抵触，因为CNN特征自然是空间和频道的。同样，视觉注意模型通常是在像素级上进行的，这可能会导致区域不连续的问题。在本文中，我们通过成功应用新的通道和空间注意力来改善VQA任务，提出了一个立方视觉关注（CVA）模型。具体来说，我们首先利用对象提案网络来生成一组候选对象并提取其相关的Conv功能，而不是参与像素。然后，我们利用这个问题来指导基于连接器特征图的通道注意力和空间注意计算。最后，将参加的视觉特征和问题结合在一起，以推断答案。我们在三个公共图像QA数据集上评估了拟议的CVA的性能，包括可可qa，VQA和Visual7W。实验结果表明，我们提出的方法显着胜过最先进的方法。

Recently, attention-based Visual Question Answering (VQA) has achieved great success by utilizing question to selectively target different visual areas that are related to the answer. Existing visual attention models are generally planar, i.e., different channels of the last conv-layer feature map of an image share the same weight. This conflicts with the attention mechanism because CNN features are naturally spatial and channel-wise. Also, visual attention models are usually conducted on pixel-level, which may cause region discontinuous problems. In this paper, we propose a Cubic Visual Attention (CVA) model by successfully applying a novel channel and spatial attention on object regions to improve VQA task. Specifically, instead of attending to pixels, we first take advantage of the object proposal networks to generate a set of object candidates and extract their associated conv features. Then, we utilize the question to guide channel attention and spatial attention calculation based on the con-layer feature map. Finally, the attended visual features and the question are combined to infer the answer. We assess the performance of our proposed CVA on three public image QA datasets, including COCO-QA, VQA and Visual7W. Experimental results show that our proposed method significantly outperforms the state-of-the-arts.

下载PDF全文

下载文献需遵守相关版权规定

论文标题