SGD是贝叶斯采样器吗？好吧，几乎

论文标题

SGD是贝叶斯采样器吗？好吧，几乎

Is SGD a Bayesian sampler? Well, almost

论文作者

Mingard, Chris, Valle-Pérez, Guillermo, Skalse, Joar, Louis, Ard A.

论文摘要

过度参数化的深神经网络（DNN）具有很高的表现力，因此，原则上可以生成适合零误差训练数据集的任何功能。这些功能中的绝大多数在看不见的数据上会表现较差，但实际上，DNN通常会非常好。这一成功表明，受过训练的DNN必须对低概括误差的功能具有很强的感应偏置。在这里，我们通过计算一系列的架构和数据集，概率$ p_ {sgd}（f \ mid s）$通过验证来调查这种归纳偏见，该概率$ p_ {sgd}（f \ mid s）$是过度参数化的DNN，该dnn受过训练，通过随机梯度下降（SGD）或其中一个变体中的$ f $ f $ f $ f n FUNCAL $ f n FUNCAL $ $ $ s $ s $ s $ s $ s $ s $ s。我们还使用高斯流程来估算DNN在随机抽样参数时表达$ f $的贝叶斯后概率$ p_b（f \ mid s）$，以$ S $为条件。我们的主要发现是$ p_ {sgd}（f \ mid s）$与$ p_b（f \ mid s）$非常相关，并且$ p_b（f \ id s）$对低率和低复杂性功能强烈偏见。这些结果表明，参数功能图中的强诱导偏置（该图（该图决定$ p_b（f \ mid s）$），而不是SGD的特殊属性，这是为什么DNNS在过度参数化方案中如此良好概括的主要解释。虽然我们的结果表明贝叶斯后$ p_b（f \ mid s）$是$ p_ {sgd}（f \ mid s）$的第一阶决定因素，但仍然存在对超参数调谐敏感的二阶差异。基于$ p_ {sgd}（f \ mid s）$和/或$ p_b（f \ mid s）$的功能概率图片可以使建筑或超参数设置（例如批处理大小，学习率和优化者选择）的变化的方式散发出新的启示。

Overparameterised deep neural networks (DNNs) are highly expressive and so can, in principle, generate almost any function that fits a training dataset with zero error. The vast majority of these functions will perform poorly on unseen data, and yet in practice DNNs often generalise remarkably well. This success suggests that a trained DNN must have a strong inductive bias towards functions with low generalisation error. Here we empirically investigate this inductive bias by calculating, for a range of architectures and datasets, the probability $P_{SGD}(f\mid S)$ that an overparameterised DNN, trained with stochastic gradient descent (SGD) or one of its variants, converges on a function $f$ consistent with a training set $S$. We also use Gaussian processes to estimate the Bayesian posterior probability $P_B(f\mid S)$ that the DNN expresses $f$ upon random sampling of its parameters, conditioned on $S$. Our main findings are that $P_{SGD}(f\mid S)$ correlates remarkably well with $P_B(f\mid S)$ and that $P_B(f\mid S)$ is strongly biased towards low-error and low complexity functions. These results imply that strong inductive bias in the parameter-function map (which determines $P_B(f\mid S)$), rather than a special property of SGD, is the primary explanation for why DNNs generalise so well in the overparameterised regime. While our results suggest that the Bayesian posterior $P_B(f\mid S)$ is the first order determinant of $P_{SGD}(f\mid S)$, there remain second order differences that are sensitive to hyperparameter tuning. A function probability picture, based on $P_{SGD}(f\mid S)$ and/or $P_B(f\mid S)$, can shed new light on the way that variations in architecture or hyperparameter settings such as batch size, learning rate, and optimiser choice, affect DNN performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题