自我监督视觉变压器的位置标签

论文标题

自我监督视觉变压器的位置标签

Positional Label for Self-Supervised Vision Transformer

论文作者

Zhang, Zhemin, Gong, Xun

论文摘要

位置编码对于视觉变压器（VIT）捕获输入图像的空间结构很重要。一般有效性已在VIT中得到证明。在我们的工作中，我们建议培训VIT以识别输入图像的贴片的位置标签，这项显然简单的任务实际上产生了有意义的自我探讨任务。基于先前关于VIT位置编码的工作，我们提出了两个专用于2D图像的位置标签，包括绝对位置和相对位置。我们的位置标签可以轻松地插入各种当前的VIT变体中。它可以通过两种方式工作：（a）作为Vanilla Vit（例如VIT-B和SWIN-B）的辅助培训目标，以提高性能。（b）结合自我监督的VIT（例如MAE），为语义特征学习提供了更强大的自我监督信号。实验表明，通过提出的自我监督方法，ImageNet上的VIT-B和SWIN-B增益提高了1.20％（TOP-1 ACC）和0.74％（TOP-1 ACC），而Mini-ImageNet上的提高了6.15％和1.14％。

Positional encoding is important for vision transformer (ViT) to capture the spatial structure of the input image. General effectiveness has been proven in ViT. In our work we propose to train ViT to recognize the positional label of patches of the input image, this apparently simple task actually yields a meaningful self-supervisory task. Based on previous work on ViT positional encoding, we propose two positional labels dedicated to 2D images including absolute position and relative position. Our positional labels can be easily plugged into various current ViT variants. It can work in two ways: (a) As an auxiliary training target for vanilla ViT (e.g., ViT-B and Swin-B) for better performance. (b) Combine the self-supervised ViT (e.g., MAE) to provide a more powerful self-supervised signal for semantic feature learning. Experiments demonstrate that with the proposed self-supervised methods, ViT-B and Swin-B gain improvements of 1.20% (top-1 Acc) and 0.74% (top-1 Acc) on ImageNet, respectively, and 6.15% and 1.14% improvement on Mini-ImageNet.

下载PDF全文

下载文献需遵守相关版权规定

论文标题