VASR：情况识别的视觉类比

论文标题

VASR：情况识别的视觉类比

VASR: Visual Analogies of Situation Recognition

论文作者

Bitton, Yonatan, Yosef, Ron, Strugo, Eli, Shahaf, Dafna, Schwartz, Roy, Stanovsky, Gabriel

论文摘要

人类认知的核心过程是类比映射：识别不同情况之间类似关系结构的能力。我们介绍了一项新颖的任务，情况识别的视觉类比，将经典的单词动物学任务调整为视觉领域。给定图像三联，任务是选择一个完成类比的图像候选b'（a'to a'就像b对什么？）。与以前关于视觉类比的工作重点是简单的图像转换，我们解决了需要理解场景的复杂类比。我们利用情况识别注释和剪辑模型产生了大量的500K候选类比。数据样本的众包注释表明，人类与数据集标签的时间约为80％（机会水平25％）。此外，我们使用人类注释来创建3,820个经过验证的类比的金标准数据集。我们的实验表明，当随机选择干扰因素（约86％）时，最新模型的功能很好，但是在精心挑选的干扰因素（〜53％，而人类准确性为90％）中。我们希望我们的数据集将鼓励开发新的类比制造模型。网站：https：//vasr-dataset.github.io/

A core process in human cognition is analogical mapping: the ability to identify a similar relational structure between different situations. We introduce a novel task, Visual Analogies of Situation Recognition, adapting the classical word-analogy task into the visual domain. Given a triplet of images, the task is to select an image candidate B' that completes the analogy (A to A' is like B to what?). Unlike previous work on visual analogy that focused on simple image transformations, we tackle complex analogies requiring understanding of scenes. We leverage situation recognition annotations and the CLIP model to generate a large set of 500k candidate analogies. Crowdsourced annotations for a sample of the data indicate that humans agree with the dataset label ~80% of the time (chance level 25%). Furthermore, we use human annotations to create a gold-standard dataset of 3,820 validated analogies. Our experiments demonstrate that state-of-the-art models do well when distractors are chosen randomly (~86%), but struggle with carefully chosen distractors (~53%, compared to 90% human accuracy). We hope our dataset will encourage the development of new analogy-making models. Website: https://vasr-dataset.github.io/

下载PDF全文

下载文献需遵守相关版权规定

论文标题