复合令牌：视觉语言表示的频道融合学习

论文标题

复合令牌：视觉语言表示的频道融合学习

Compound Tokens: Channel Fusion for Vision-Language Representation Learning

论文作者

Aladago, Maxwell Mbabilla, Piergiovanni, AJ

论文摘要

我们提出了一种融合视觉和语言表示的有效方法，用于几个问题回答任务，包括视觉问答和视觉效果。与先前的联合单峰表示或仅使用交叉注意的作品相反，我们通过通道融合组成了多模式表示。通过在通道上融合，该模型能够与标准方法更有效地对齐令牌。我们称之为复合令牌的这些多模式表示形式是通过交叉注意变压器层生成的。首先，视力令牌被用作查询，以通过跨注意来检索兼容的文本令牌。然后，我们将视觉令牌和沿通道维度查询的文本令牌进行链接。我们称之为结果表示代币。第二组复合令牌是使用类似过程生成的，在该过程中，文本令牌是对跨注意层的查询。我们将所有化合物令牌与多模式编码器进行进一步处理。我们使用经编码器式视觉模型在开放式摄影设置中训练有素的端到端，证明了化合物代币的有效性。复合令牌在包括GQA，VQA2.0和SNLI-VE在内的一系列问题的问题上实现了高度竞争性的性能。

We present an effective method for fusing visual-and-language representations for several question answering tasks including visual question answering and visual entailment. In contrast to prior works that concatenate unimodal representations or use only cross-attention, we compose multimodal representations via channel fusion. By fusing on the channels, the model is able to more effectively align the tokens compared to standard methods. These multimodal representations, which we call compound tokens are generated with cross-attention transformer layers. First, vision tokens are used as queries to retrieve compatible text tokens through cross-attention. We then chain the vision tokens and the queried text tokens along the channel dimension. We call the resulting representations compound tokens. A second group of compound tokens are generated using an analogous process where the text tokens serve as queries to the cross-attention layer. We concatenate all the compound tokens for further processing with multimodal encoder. We demonstrate the effectiveness of compound tokens using an encoder-decoder vision-language model trained end-to-end in the open-vocabulary setting. Compound Tokens achieve highly competitive performance across a range of question answering tasks including GQA, VQA2.0, and SNLI-VE.

下载PDF全文

下载文献需遵守相关版权规定

论文标题