论文标题

研究图像分类任务中的归纳偏见

Studying inductive biases in image classification task

论文作者

Arizumi, Nana

论文摘要

最近,自我注意力(SA)结构在计算机视野领域变得很流行。它们具有本地独立的过滤器,可以使用大型内核,这与以前流行的卷积神经网络(CNN)相矛盾。 CNN的成功归因于当地和空间不变性的硬编码电感偏差。但是,最近的研究表明,CNN中的归纳性偏差过于限制。另一方面,与深度(DW)卷积相似的相对位置编码对于本地SA网络是必需的,这表明SA结构并非完全是空间变体。因此,我们想确定归纳偏见的哪一部分有助于局部SA结构的成功。为此,我们引入了上下文感知的分解注意力(CADA),该注意力将注意力图分解为多个可训练的基础内核,并使用上下文感知(CA)参数积累它们。这样,我们可以确定CNN和SA网络之间的链接。我们使用应用于ImageNet分类任务的RESNET50进行了消融研究。与CNN相比,DW卷积可能具有较大的位置而不会增加计算成本,但是较大的内核的精度饱和。卡达遵循了当地的这个特征。我们表明,情境意识是至关重要的。但是,构建CA参数并不需要大量的本地信息。即使没有空间不变性使训练变得困难,但更轻松的空间不变性比严格的空间不变性更好。同样,通过相对位置编码的额外强烈的空间不变性是可取的。我们将这些实验扩展到过滤器进行下采样,并表明局部性偏差对于下采样更为重要,但可以使用松弛的空间不变性来消除强烈的位置偏见。

Recently, self-attention (SA) structures became popular in computer vision fields. They have locally independent filters and can use large kernels, which contradicts the previously popular convolutional neural networks (CNNs). CNNs success was attributed to the hard-coded inductive biases of locality and spatial invariance. However, recent studies have shown that inductive biases in CNNs are too restrictive. On the other hand, the relative position encodings, similar to depthwise (DW) convolution, are necessary for the local SA networks, which indicates that the SA structures are not entirely spatially variant. Hence, we would like to determine which part of inductive biases contributes to the success of the local SA structures. To do so, we introduced context-aware decomposed attention (CADA), which decomposes attention maps into multiple trainable base kernels and accumulates them using context-aware (CA) parameters. This way, we could identify the link between the CNNs and SA networks. We conducted ablation studies using the ResNet50 applied to the ImageNet classification task. DW convolution could have a large locality without increasing computational costs compared to CNNs, but the accuracy saturates with larger kernels. CADA follows this characteristic of locality. We showed that context awareness was the crucial property; however, large local information was not necessary to construct CA parameters. Even though no spatial invariance makes training difficult, more relaxed spatial invariance gave better accuracy than strict spatial invariance. Also, additional strong spatial invariance through relative position encoding was preferable. We extended these experiments to filters for downsampling and showed that locality bias is more critical for downsampling but can remove the strong locality bias using relaxed spatial invariance.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源