图像文本匹配的图形结构化网络

论文标题

图像文本匹配的图形结构化网络

Graph Structured Network for Image-Text Matching

论文作者

Liu, Chunxiao, Mao, Zhendong, Zhang, Tianzhu, Xie, Hongtao, Wang, Bin, Zhang, Yongdong

论文摘要

图像文本匹配引起了人们的兴趣，因为它弥合了视觉和语言。关键挑战在于如何学习图像和文本之间的对应关系。现有作品基于对象共发生的统计信息学习粗糙的对应关系，同时未能学习细粒的短语对应关系。在本文中，我们提出了一个新颖的图结构匹配网络（GSMN），以学习细粒的对应关系。 GSMN明确地将对象，关系和属性建模为结构化短语，该短语不仅允许单独学习对象，关系和属性的对应关系，而且还可以学习结构化短语的细粒度对应关系。这是通过节点级匹配和结构级匹配来实现的。节点级匹配将每个节点与其相关节点相关联，来自另一个模态，该节点可以是对象，关系或属性。然后，相关的节点通过在结构级匹配下融合邻域关联，共同推断出细粒的对应关系。综合实验表明，GSMN在基准测试上的表现优于最先进的方法，相对召回@1在Flickr30k和Mscoco上分别提高了近7％和2％。代码将在以下网址发布：https：//github.com/crossmodalgroup/gsmn。

Image-text matching has received growing interest since it bridges vision and language. The key challenge lies in how to learn correspondence between image and text. Existing works learn coarse correspondence based on object co-occurrence statistics, while failing to learn fine-grained phrase correspondence. In this paper, we present a novel Graph Structured Matching Network (GSMN) to learn fine-grained correspondence. The GSMN explicitly models object, relation and attribute as a structured phrase, which not only allows to learn correspondence of object, relation and attribute separately, but also benefits to learn fine-grained correspondence of structured phrase. This is achieved by node-level matching and structure-level matching. The node-level matching associates each node with its relevant nodes from another modality, where the node can be object, relation or attribute. The associated nodes then jointly infer fine-grained correspondence by fusing neighborhood associations at structure-level matching. Comprehensive experiments show that GSMN outperforms state-of-the-art methods on benchmarks, with relative Recall@1 improvements of nearly 7% and 2% on Flickr30K and MSCOCO, respectively. Code will be released at: https://github.com/CrossmodalGroup/GSMN.

下载PDF全文

下载文献需遵守相关版权规定

论文标题