论文标题

神经符号的视觉推理:从“推理”中解开“视觉”

Neuro-Symbolic Visual Reasoning: Disentangling "Visual" from "Reasoning"

论文作者

Amizadeh, Saeed, Palangi, Hamid, Polozov, Oleksandr, Huang, Yichen, Koishida, Kazuhito

论文摘要

视觉推理任务,例如视觉问题回答(VQA)需要视觉感知的相互作用与关于基于感知的语义问题的推理。但是,该领域的最新进展仍主要是由知觉改进(例如场景图生成)而不是推理驱动的。神经模块等神经符号模型将组成​​推理的好处带入了VQA,但它们仍然与视觉表示学习纠缠在一起,因此很难自行改进和评估神经推理。为了解决这个问题,我们提出了(1)一个框架来隔离和评估VQA与感知分开的推理方面,以及(2)一种新颖的自上而下的校准技术,即使在不完美的感知下,该模型也可以回答推理问题。为此,我们为VQA介绍了一种可区分的一阶逻辑形式,该形式明确地将问题解答为视觉感知。在具有挑战性的GQA数据集上,该框架用于执行深入的,分散的比较,从而在众所周知的VQA模型之间进行了有关参与模型和任务的信息见解。

Visual reasoning tasks such as visual question answering (VQA) require an interplay of visual perception with reasoning about the question semantics grounded in perception. However, recent advances in this area are still primarily driven by perception improvements (e.g. scene graph generation) rather than reasoning. Neuro-symbolic models such as Neural Module Networks bring the benefits of compositional reasoning to VQA, but they are still entangled with visual representation learning, and thus neural reasoning is hard to improve and assess on its own. To address this, we propose (1) a framework to isolate and evaluate the reasoning aspect of VQA separately from its perception, and (2) a novel top-down calibration technique that allows the model to answer reasoning questions even with imperfect perception. To this end, we introduce a differentiable first-order logic formalism for VQA that explicitly decouples question answering from visual perception. On the challenging GQA dataset, this framework is used to perform in-depth, disentangled comparisons between well-known VQA models leading to informative insights regarding the participating models as well as the task.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源