论文标题
释义是新颖的对象字幕所需的全部
Paraphrasing Is All You Need for Novel Object Captioning
论文作者
论文摘要
新颖的对象字幕(NOC)旨在描述包含对象的图像,而无需在训练过程中观察其地面真相标题。由于缺乏字幕注释,无法通过序列到序列训练或苹果酒优化直接优化字幕模型。结果,我们提出了启用释义(P2C),这是一个针对NOC的两阶段学习框架,它将通过释义通过释义优化输出字幕。使用P2C,字幕模型首先从预先培训的仅在文本语料库上进行训练的语言模型学习,从而扩展了Bank一词以提高语言流利性。为了进一步实施足够描述输入图像的视觉内容的输出字幕,我们对引入的忠诚度和充分性目标进行字幕模型执行自我贴形。由于在训练过程中没有任何地面真相标题可用于新颖的对象图像,因此我们的P2C杠杆化(图像文本)关联模块可确保可以正确保留上述字幕特征。在实验中,我们不仅表明我们的P2C在NOCAPS和可可字幕数据集上实现了最先进的性能,还通过替换NOC的语言和跨模式关联模型来验证学习框架的有效性和灵活性。补充材料中提供实施详细信息和代码。
Novel object captioning (NOC) aims to describe images containing objects without observing their ground truth captions during training. Due to the absence of caption annotation, captioning models cannot be directly optimized via sequence-to-sequence training or CIDEr optimization. As a result, we present Paraphrasing-to-Captioning (P2C), a two-stage learning framework for NOC, which would heuristically optimize the output captions via paraphrasing. With P2C, the captioning model first learns paraphrasing from a language model pre-trained on text-only corpus, allowing expansion of the word bank for improving linguistic fluency. To further enforce the output caption sufficiently describing the visual content of the input image, we perform self-paraphrasing for the captioning model with fidelity and adequacy objectives introduced. Since no ground truth captions are available for novel object images during training, our P2C leverages cross-modality (image-text) association modules to ensure the above caption characteristics can be properly preserved. In the experiments, we not only show that our P2C achieves state-of-the-art performances on nocaps and COCO Caption datasets, we also verify the effectiveness and flexibility of our learning framework by replacing language and cross-modality association models for NOC. Implementation details and code are available in the supplementary materials.