论文标题
Mamo:用于细粒视力语言表示学习的掩盖多模型建模
MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning
论文作者
论文摘要
多模式表示学习显示了各种视觉语言任务的有希望的改进。大多数现有的方法在构建视觉和语言之间建立全球层面的一致性方面表现出色,同时缺乏有效的细粒度图像文本相互作用。在本文中,我们提出了一种共同掩盖的多模式建模方法,以学习细粒的多模式表示。我们的方法对图像文本输入执行关节掩蔽,并集成了隐式和明确的目标,以恢复掩盖信号。隐式目标为视觉和语言提供了一个统一的和辩护的目标,该模型预测了未掩盖的输入的潜在多模式表示。明确的目标通过恢复高级和语义上有意义的信息进一步丰富了多模式表示:图像贴片的动量视觉特征和单词令牌的概念。通过这样的掩盖建模过程,我们的模型不仅学习了细粒的多模式相互作用,而且还避免了高级表示和低级或中级预测目标(例如图像像素)之间的语义差距,从而产生了在零点和微调的设置上都能良好的语义富含多模态表示。我们的预训练模型(名为MAMO)在各种下游视力语言任务上实现了最先进的表现,包括图像文本检索,视觉质量答案,视觉推理和弱监督的视觉接地。
Multimodal representation learning has shown promising improvements on various vision-language tasks. Most existing methods excel at building global-level alignment between vision and language while lacking effective fine-grained image-text interaction. In this paper, we propose a jointly masked multimodal modeling method to learn fine-grained multimodal representations. Our method performs joint masking on image-text input and integrates both implicit and explicit targets for the masked signals to recover. The implicit target provides a unified and debiased objective for vision and language, where the model predicts latent multimodal representations of the unmasked input. The explicit target further enriches the multimodal representations by recovering high-level and semantically meaningful information: momentum visual features of image patches and concepts of word tokens. Through such a masked modeling process, our model not only learns fine-grained multimodal interaction, but also avoids the semantic gap between high-level representations and low- or mid-level prediction targets (e.g. image pixels), thus producing semantically rich multimodal representations that perform well on both zero-shot and fine-tuned settings. Our pre-trained model (named MAMO) achieves state-of-the-art performance on various downstream vision-language tasks, including image-text retrieval, visual question answering, visual reasoning, and weakly-supervised visual grounding.