点击：文本感知文本VQA和文本构成的预培训

论文标题

点击：文本感知文本VQA和文本构成的预培训

TAP: Text-Aware Pre-training for Text-VQA and Text-Caption

论文作者

Yang, Zhengyuan, Lu, Yijuan, Wang, Jianfeng, Yin, Xi, Florencio, Dinei, Wang, Lijuan, Zhang, Cha, Zhang, Lei, Luo, Jiebo

论文摘要

在本文中，我们为文本VQA和文本启动任务提出了文本感知的预培训（TAP）。这两个任务旨在读取和理解图像中的场景文本，以分别以回答和图像字幕生成。与未能捕获场景文本及其与视觉和文本模式的关系的传统视觉预训练相反，TAP将明确合并在预训练中的场景文本（从OCR引擎生成）。通过三个预训练任务，包括蒙版语言建模（MLM），图像文本（对比）匹配（ITM）和相对（空间）位置预测（RPP），TAP有效地帮助该模型在三种模式中学习了更好的对齐表示：文本，Visual Object和场景文本。由于这种对齐表示的学习，甚至在同一下游任务数据集上进行了预训练，TAP已经将TextVQA数据集的绝对准确性提高了 +5.4％，而不是非TAP基线。为了进一步提高性能，我们根据名为OCR-CC的概念标题数据集构建了一个大规模数据集，该数据集包含140万个与文本相关的图像文本对。在此OCR-CC数据集上进行了预先训练，我们的方法在多个任务上的大幅度优于最高水平，即TextVQA上的 +8.3％的精度，ST-VQA上的 +8.6％的精度和+10.2 cider得分在TextCaps上。

In this paper, we propose Text-Aware Pre-training (TAP) for Text-VQA and Text-Caption tasks. These two tasks aim at reading and understanding scene text in images for question answering and image caption generation, respectively. In contrast to the conventional vision-language pre-training that fails to capture scene text and its relationship with the visual and text modalities, TAP explicitly incorporates scene text (generated from OCR engines) in pre-training. With three pre-training tasks, including masked language modeling (MLM), image-text (contrastive) matching (ITM), and relative (spatial) position prediction (RPP), TAP effectively helps the model learn a better aligned representation among the three modalities: text word, visual object, and scene text. Due to this aligned representation learning, even pre-trained on the same downstream task dataset, TAP already boosts the absolute accuracy on the TextVQA dataset by +5.4%, compared with a non-TAP baseline. To further improve the performance, we build a large-scale dataset based on the Conceptual Caption dataset, named OCR-CC, which contains 1.4 million scene text-related image-text pairs. Pre-trained on this OCR-CC dataset, our approach outperforms the state of the art by large margins on multiple tasks, i.e., +8.3% accuracy on TextVQA, +8.6% accuracy on ST-VQA, and +10.2 CIDEr score on TextCaps.

下载PDF全文

下载文献需遵守相关版权规定

论文标题