论文标题
通过对脑视觉语言特征的多模式学习来解码视觉神经表示
Decoding Visual Neural Representations by Multimodal Learning of Brain-Visual-Linguistic Features
论文作者
论文摘要
解码人类视觉神经表示是一项具有挑战性的任务,在揭示视觉处理机制和开发类似大脑的智能机制方面具有巨大的科学意义。大多数现有的方法都很难推广到没有相应神经数据的新类别。两个主要原因是1)神经数据基础的多模式语义知识的解释不足,以及2)配对数量少的训练数据。为了克服这些局限性,本文提出了一种通用神经解码方法,称为BRAVL,该方法使用了脑视觉语言特征的多模式学习。我们专注于通过多模式深生成模型对大脑,视觉和语言特征之间的关系进行建模。具体而言,我们利用了专家配方的混合物来推断潜在的代码,该代码能够使所有三种方式的连贯共同生成。为了学习更加一致的联合表示并提高大脑活动数据有限的数据效率,我们利用了内部和模式间互信息最大化正则化项。特别是,我们的BRAVL模型可以在各种半监督场景下进行训练,以结合从多余类别获得的视觉和文本特征。最后,我们构建了三个三座匹配数据集,并且广泛的实验得出了一些有趣的结论和认知见解:1)实际上可以很好地以良好的精度从人脑活动中解码新型的视觉类别; 2)使用视觉和语言特征的组合进行解码模型要比单独使用任何一个的模型要好得多; 3)视觉感知可能伴随着语言影响,以代表视觉刺激的语义。代码和数据:https://github.com/changdedu/bravl。
Decoding human visual neural representations is a challenging task with great scientific significance in revealing vision-processing mechanisms and developing brain-like intelligent machines. Most existing methods are difficult to generalize to novel categories that have no corresponding neural data for training. The two main reasons are 1) the under-exploitation of the multimodal semantic knowledge underlying the neural data and 2) the small number of paired (stimuli-responses) training data. To overcome these limitations, this paper presents a generic neural decoding method called BraVL that uses multimodal learning of brain-visual-linguistic features. We focus on modeling the relationships between brain, visual and linguistic features via multimodal deep generative models. Specifically, we leverage the mixture-of-product-of-experts formulation to infer a latent code that enables a coherent joint generation of all three modalities. To learn a more consistent joint representation and improve the data efficiency in the case of limited brain activity data, we exploit both intra- and inter-modality mutual information maximization regularization terms. In particular, our BraVL model can be trained under various semi-supervised scenarios to incorporate the visual and textual features obtained from the extra categories. Finally, we construct three trimodal matching datasets, and the extensive experiments lead to some interesting conclusions and cognitive insights: 1) decoding novel visual categories from human brain activity is practically possible with good accuracy; 2) decoding models using the combination of visual and linguistic features perform much better than those using either of them alone; 3) visual perception may be accompanied by linguistic influences to represent the semantics of visual stimuli. Code and data: https://github.com/ChangdeDu/BraVL.