论文标题
优化视觉变压器的相关图可改善鲁棒性
Optimizing Relevance Maps of Vision Transformers Improves Robustness
论文作者
论文摘要
已经观察到,视觉分类模型通常主要依赖图像背景,忽略了前景,这损害了他们对分布变化的稳健性。为了减轻这一缺点,我们建议监视模型的相关信号并操纵它,以使模型专注于前景对象。这是作为一个填充步骤完成的,涉及相对较少的由图像对及其相关前景口罩组成的样品。具体来说,我们鼓励模型的相关图(i)分配较低的相关性,(ii)从前景中考虑尽可能多的信息,(iii)我们鼓励这些决定具有很高的信心。当应用于视觉变压器(VIT)模型时,可以观察到稳健性对域移位的明显改善。此外,可以自动从VIT模型本身的自我监督变体中自动获得前景面具。因此,不需要其他监督。
It has been observed that visual classification models often rely mostly on the image background, neglecting the foreground, which hurts their robustness to distribution changes. To alleviate this shortcoming, we propose to monitor the model's relevancy signal and manipulate it such that the model is focused on the foreground object. This is done as a finetuning step, involving relatively few samples consisting of pairs of images and their associated foreground masks. Specifically, we encourage the model's relevancy map (i) to assign lower relevance to background regions, (ii) to consider as much information as possible from the foreground, and (iii) we encourage the decisions to have high confidence. When applied to Vision Transformer (ViT) models, a marked improvement in robustness to domain shifts is observed. Moreover, the foreground masks can be obtained automatically, from a self-supervised variant of the ViT model itself; therefore no additional supervision is required.