视觉变压器的统一和生物学上的关系图表示

论文标题

视觉变压器的统一和生物学上的关系图表示

A Unified and Biologically-Plausible Relational Graph Representation of Vision Transformers

论文作者

Chen, Yuzhong, Du, Yu, Xiao, Zhenxiang, Zhao, Lin, Zhang, Lu, Liu, David Weizhong, Zhu, Dajiang, Zhang, Tuo, Hu, Xintao, Liu, Tianming, Jiang, Xi

论文摘要

视觉变压器（VIT）及其变体在各种视觉任务中取得了巨大的成功。这些VIT模型的关键特征是在人工神经网络（ANN）中采用不同的空间贴片信息的聚合策略。但是，仍然缺乏对不同VIT体系结构的统一表示，以系统地理解和评估模型表示性能。此外，那些表现良好的VIT ANN与实际生物神经网络（BNN）的相似之处在很大程度上没有探索。为了回答这些基本问题，我们首次提出了VIT模型的统一和生物学上的关系图表示。具体而言，所提出的关系图表示由两个关键子图组成：聚合图和仿射图。前者将VIT令牌视为节点并描述其空间相互作用，而后者则将网络渠道视为节点，并反映了频道之间的信息通信。使用此统一的关系图表示，我们发现：a）聚集图的最佳点会导致VIT具有显着改善的预测性能； b）聚类系数和平均路径长度的图测量是模型预测性能的两个有效指标，尤其是在使用小样本的数据集上应用； c）我们的发现在各种VIT体系结构和多个数据集中都是一致的； D）VIT的拟议关系图表示与脑科学数据得出的实际BNN具有很高的相似性。总体而言，我们的工作提供了一种新型的统一和生物学上的范式，以更加可解释和有效地表示VIT ANN。

Vision transformer (ViT) and its variants have achieved remarkable successes in various visual tasks. The key characteristic of these ViT models is to adopt different aggregation strategies of spatial patch information within the artificial neural networks (ANNs). However, there is still a key lack of unified representation of different ViT architectures for systematic understanding and assessment of model representation performance. Moreover, how those well-performing ViT ANNs are similar to real biological neural networks (BNNs) is largely unexplored. To answer these fundamental questions, we, for the first time, propose a unified and biologically-plausible relational graph representation of ViT models. Specifically, the proposed relational graph representation consists of two key sub-graphs: aggregation graph and affine graph. The former one considers ViT tokens as nodes and describes their spatial interaction, while the latter one regards network channels as nodes and reflects the information communication between channels. Using this unified relational graph representation, we found that: a) a sweet spot of the aggregation graph leads to ViTs with significantly improved predictive performance; b) the graph measures of clustering coefficient and average path length are two effective indicators of model prediction performance, especially when applying on the datasets with small samples; c) our findings are consistent across various ViT architectures and multiple datasets; d) the proposed relational graph representation of ViT has high similarity with real BNNs derived from brain science data. Overall, our work provides a novel unified and biologically-plausible paradigm for more interpretable and effective representation of ViT ANNs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题