通过学习跨模式信息检索的语义增强的硬质底片来改善视觉语义的嵌入

论文标题

通过学习跨模式信息检索的语义增强的硬质底片来改善视觉语义的嵌入

Improving Visual-Semantic Embeddings by Learning Semantically-Enhanced Hard Negatives for Cross-modal Information Retrieval

论文作者

Gong, Yan, Cosma, Georgina

论文摘要

视觉语义嵌入（VSE）旨在提取图像及其描述的语义，并将它们嵌入相同的潜在空间以进行跨模式信息检索。大多数现有的VSE网络都通过采用硬负损失函数来训练，该损失函数学会了相关和无关的图像描述对嵌入对之间的相似性之间的客观余量。但是，硬负损失函数的客观边缘设置为固定的超参数，忽略了不相关的图像描述对的语义差异。为了解决在获得训练有素的VSE网络之前测量图像描述对之间最佳相似性的挑战，本文提出了一种新的方法，其中包括两个主要部分：（1）找到图像描述的基本语义；（2）提出了一种新颖的语义增强的硬质量损失函数，其中学习目标是基于不相关的图像描述对之间的最佳相似性得分动态确定的。通过将提出的方法集成到五个最先进的VSE网络中，这些方法用于三个基准数据集以进行跨模式信息检索任务，从而进行了广泛的实验。结果表明，所提出的方法实现了最佳性能，也可以由现有和未来的VSE网络采用。

Visual Semantic Embedding (VSE) aims to extract the semantics of images and their descriptions, and embed them into the same latent space for cross-modal information retrieval. Most existing VSE networks are trained by adopting a hard negatives loss function which learns an objective margin between the similarity of relevant and irrelevant image-description embedding pairs. However, the objective margin in the hard negatives loss function is set as a fixed hyperparameter that ignores the semantic differences of the irrelevant image-description pairs. To address the challenge of measuring the optimal similarities between image-description pairs before obtaining the trained VSE networks, this paper presents a novel approach that comprises two main parts: (1) finds the underlying semantics of image descriptions; and (2) proposes a novel semantically enhanced hard negatives loss function, where the learning objective is dynamically determined based on the optimal similarity scores between irrelevant image-description pairs. Extensive experiments were carried out by integrating the proposed methods into five state-of-the-art VSE networks that were applied to three benchmark datasets for cross-modal information retrieval tasks. The results revealed that the proposed methods achieved the best performance and can also be adopted by existing and future VSE networks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题