MAPL：单次预训练模型的参数效率改编，用于视觉语言少数弹药提示

论文标题

MAPL：单次预训练模型的参数效率改编，用于视觉语言少数弹药提示

MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting

论文作者

Mañas, Oscar, Rodriguez, Pau, Ahmadi, Saba, Nematzadeh, Aida, Goyal, Yash, Agrawal, Aishwarya

论文摘要

事实证明，大型的预训练模型在单峰视觉和语言任务中是显着的零和（及时的）少数学习者。我们提出了MAPL，这是一种简单且参数有效的方法，它可以重用冷冻的预训练的单峰模型，并利用其在多模式视觉语言（VL）设置中的强大概括能力。 MAPL使用对齐的图像text数据学习了单峰模型的表示空间之间的轻量级映射，并且可以从仅几个文章中的示例中概括地看不见VL任务。少数可训练的参数使MAPL在低数据和内域学习中有效。此外，MAPL的模块化可以轻松扩展到其他预训练的模型。关于几个视觉问题回答和图像字幕的基准测试的广泛实验表明，与类似方法相比，MAPL在训练较少的参数较少的情况下取得了优越或竞争性的性能。使用适中的计算资源和公共数据集可以在短短几个小时内培训MAPL。我们在https://github.com/mair-lab/mapl上发布代码和预训练的模型权重。

Large pre-trained models have proved to be remarkable zero- and (prompt-based) few-shot learners in unimodal vision and language tasks. We propose MAPL, a simple and parameter-efficient method that reuses frozen pre-trained unimodal models and leverages their strong generalization capabilities in multimodal vision-language (VL) settings. MAPL learns a lightweight mapping between the representation spaces of unimodal models using aligned image-text data, and can generalize to unseen VL tasks from just a few in-context examples. The small number of trainable parameters makes MAPL effective at low-data and in-domain learning. Moreover, MAPL's modularity enables easy extension to other pre-trained models. Extensive experiments on several visual question answering and image captioning benchmarks show that MAPL achieves superior or competitive performance compared to similar methods while training orders of magnitude fewer parameters. MAPL can be trained in just a few hours using modest computational resources and public datasets. We release our code and pre-trained model weights at https://github.com/mair-lab/mapl.

下载PDF全文

下载文献需遵守相关版权规定

论文标题