通用视觉密集检索：学习多模式检索的统一表示空间

论文标题

通用视觉密集检索：学习多模式检索的统一表示空间

Universal Vision-Language Dense Retrieval: Learning A Unified Representation Space for Multi-Modal Retrieval

论文作者

Liu, Zhenghao, Xiong, Chenyan, Lv, Yuanhuiyi, Liu, Zhiyuan, Yu, Ge

论文摘要

本文介绍了通用视觉的密集检索（Univl-DR），该检索构建了一个统一的多模式检索模型。 Univl-DR在嵌入空间中编码查询和多模式资源，以搜索来自不同模式的候选人。为了学习多模式检索的统一嵌入空间，Univl-DR提出了两种技术：1）通用嵌入优化策略，该策略使用模态均衡的硬质量来对比优化嵌入空间； 2）图像言语方法，它弥合了原始数据空间中图像和文本之间的模态差距。 Univl-DR在多模式开放域问题上实现了最新的答案基准，webQA，并且在两个子任务上胜过所有检索模型，即文本文本检索和文本图像检索。它表明，通用多模式搜索是可行的，可以用联合模型替换划分和互动管道，并且也有利于单/交叉模式任务。这项工作的所有源代码均可在https://github.com/openmatch/univl-dr上获得。

This paper presents Universal Vision-Language Dense Retrieval (UniVL-DR), which builds a unified model for multi-modal retrieval. UniVL-DR encodes queries and multi-modality resources in an embedding space for searching candidates from different modalities. To learn a unified embedding space for multi-modal retrieval, UniVL-DR proposes two techniques: 1) Universal embedding optimization strategy, which contrastively optimizes the embedding space using the modality-balanced hard negatives; 2) Image verbalization method, which bridges the modality gap between images and texts in the raw data space. UniVL-DR achieves the state-of-the-art on the multi-modal open-domain question answering benchmark, WebQA, and outperforms all retrieval models on the two subtasks, text-text retrieval and text-image retrieval. It demonstrates that universal multi-modal search is feasible to replace the divide-and-conquer pipeline with a united model and also benefits single/cross modality tasks. All source codes of this work are available at https://github.com/OpenMatch/UniVL-DR.

下载PDF全文

下载文献需遵守相关版权规定

论文标题