论文标题
通用视觉密集检索:学习多模式检索的统一表示空间
Universal Vision-Language Dense Retrieval: Learning A Unified Representation Space for Multi-Modal Retrieval
论文作者
论文摘要
本文介绍了通用视觉的密集检索(Univl-DR),该检索构建了一个统一的多模式检索模型。 Univl-DR在嵌入空间中编码查询和多模式资源,以搜索来自不同模式的候选人。为了学习多模式检索的统一嵌入空间,Univl-DR提出了两种技术:1)通用嵌入优化策略,该策略使用模态均衡的硬质量来对比优化嵌入空间; 2)图像言语方法,它弥合了原始数据空间中图像和文本之间的模态差距。 Univl-DR在多模式开放域问题上实现了最新的答案基准,webQA,并且在两个子任务上胜过所有检索模型,即文本文本检索和文本图像检索。它表明,通用多模式搜索是可行的,可以用联合模型替换划分和互动管道,并且也有利于单/交叉模式任务。这项工作的所有源代码均可在https://github.com/openmatch/univl-dr上获得。
This paper presents Universal Vision-Language Dense Retrieval (UniVL-DR), which builds a unified model for multi-modal retrieval. UniVL-DR encodes queries and multi-modality resources in an embedding space for searching candidates from different modalities. To learn a unified embedding space for multi-modal retrieval, UniVL-DR proposes two techniques: 1) Universal embedding optimization strategy, which contrastively optimizes the embedding space using the modality-balanced hard negatives; 2) Image verbalization method, which bridges the modality gap between images and texts in the raw data space. UniVL-DR achieves the state-of-the-art on the multi-modal open-domain question answering benchmark, WebQA, and outperforms all retrieval models on the two subtasks, text-text retrieval and text-image retrieval. It demonstrates that universal multi-modal search is feasible to replace the divide-and-conquer pipeline with a united model and also benefits single/cross modality tasks. All source codes of this work are available at https://github.com/OpenMatch/UniVL-DR.