论文标题
找到一个:以人为中心的基础的视觉常识理解
Find Someone Who: Visual Commonsense Understanding in Human-Centric Grounding
论文作者
论文摘要
从包含多个人的视觉场景中,人类能够区分每个人,鉴于上下文描述了以前发生的事情,他们的精神/身体状态或意图等。上面的能力在很大程度上依赖于以人为中心的常识性知识和推理。例如,如果被要求识别图像中的“需要康复的人”,我们首先需要知道他们通常会受伤或遭受痛苦的表情,然后在最终扎根人之前找到相应的视觉线索。我们提出了一项新的常识性任务,即以人为本的常识基础,该任务测试了模型的个人能力,鉴于有关以前发生的事情的上下文描述,以及他们的心理/身体状态或意图。我们进一步创建了一个基准Humancog,Humancog是一个数据集,该数据集在67K图像上注释了130k接地的常识性描述,涵盖了各种常识和视觉场景。我们将上下文对象感知方法设置为一种强大的基线,该基线表现优于先前的预训练和未经预言的模型。进一步的分析表明,多模式常识的丰富视觉常识和强大的整合至关重要,这阐明了未来的作品。数据和代码将提供https://github.com/hxyou/humancog。
From a visual scene containing multiple people, human is able to distinguish each individual given the context descriptions about what happened before, their mental/physical states or intentions, etc. Above ability heavily relies on human-centric commonsense knowledge and reasoning. For example, if asked to identify the "person who needs healing" in an image, we need to first know that they usually have injuries or suffering expressions, then find the corresponding visual clues before finally grounding the person. We present a new commonsense task, Human-centric Commonsense Grounding, that tests the models' ability to ground individuals given the context descriptions about what happened before, and their mental/physical states or intentions. We further create a benchmark, HumanCog, a dataset with 130k grounded commonsensical descriptions annotated on 67k images, covering diverse types of commonsense and visual scenes. We set up a context-object-aware method as a strong baseline that outperforms previous pre-trained and non-pretrained models. Further analysis demonstrates that rich visual commonsense and powerful integration of multi-modal commonsense are essential, which sheds light on future works. Data and code will be available https://github.com/Hxyou/HumanCog.