论文标题
无限注意:深度注意网络的NNGP和NTK
Infinite attention: NNGP and NTK for deep attention networks
论文作者
论文摘要
关于广泛的神经网络(NNS)和高斯过程(GPS)之间关系的文献越来越多,确定了两者之间在各种NN体系结构之间的等价性。例如,这种等价能够准确地近似没有MCMC或变异近似值的宽贝叶斯NN的行为,或表征通过梯度下降优化的随机初始初始化的宽NN,而无需运行优化器。我们为涉及注意层的NN提供了严格的扩展,这表明与单头注意不同,这会引起非高斯行为,多头注意力体系结构的行为为GP,因为头部数量往往是无限的。我们进一步讨论了位置编码和层归一化的影响,并提出了注意机制的修改,从而改善了有限和无限宽的NN的结果。我们通过经验评估注意力内核,从而在没有可训练的内核和高级数据预处理的GPS上对CIFAR-10上的先前最先进的ART进行了适度的改进。最后,我们向神经切线库(Novak等,2020)介绍了新功能,允许NNGP/NTK模型的应用有或没有注意力,并在IMDB评论数据集中进行了示例。
There is a growing amount of literature on the relationship between wide neural networks (NNs) and Gaussian processes (GPs), identifying an equivalence between the two for a variety of NN architectures. This equivalence enables, for instance, accurate approximation of the behaviour of wide Bayesian NNs without MCMC or variational approximations, or characterisation of the distribution of randomly initialised wide NNs optimised by gradient descent without ever running an optimiser. We provide a rigorous extension of these results to NNs involving attention layers, showing that unlike single-head attention, which induces non-Gaussian behaviour, multi-head attention architectures behave as GPs as the number of heads tends to infinity. We further discuss the effects of positional encodings and layer normalisation, and propose modifications of the attention mechanism which lead to improved results for both finite and infinitely wide NNs. We evaluate attention kernels empirically, leading to a moderate improvement upon the previous state-of-the-art on CIFAR-10 for GPs without trainable kernels and advanced data preprocessing. Finally, we introduce new features to the Neural Tangents library (Novak et al., 2020) allowing applications of NNGP/NTK models, with and without attention, to variable-length sequences, with an example on the IMDb reviews dataset.