论文标题
SGD是贝叶斯采样器吗?好吧,几乎
Is SGD a Bayesian sampler? Well, almost
论文作者
论文摘要
过度参数化的深神经网络(DNN)具有很高的表现力,因此,原则上可以生成适合零误差训练数据集的任何功能。这些功能中的绝大多数在看不见的数据上会表现较差,但实际上,DNN通常会非常好。这一成功表明,受过训练的DNN必须对低概括误差的功能具有很强的感应偏置。在这里,我们通过计算一系列的架构和数据集,概率$ p_ {sgd}(f \ mid s)$通过验证来调查这种归纳偏见,该概率$ p_ {sgd}(f \ mid s)$是过度参数化的DNN,该dnn受过训练,通过随机梯度下降(SGD)或其中一个变体中的$ f $ f $ f $ f n FUNCAL $ f n FUNCAL $ $ $ s $ s $ s $ s $ s $ s $ s。我们还使用高斯流程来估算DNN在随机抽样参数时表达$ f $的贝叶斯后概率$ p_b(f \ mid s)$,以$ S $为条件。 我们的主要发现是$ p_ {sgd}(f \ mid s)$与$ p_b(f \ mid s)$非常相关,并且$ p_b(f \ id s)$对低率和低复杂性功能强烈偏见。这些结果表明,参数功能图中的强诱导偏置(该图(该图决定$ p_b(f \ mid s)$),而不是SGD的特殊属性,这是为什么DNNS在过度参数化方案中如此良好概括的主要解释。 虽然我们的结果表明贝叶斯后$ p_b(f \ mid s)$是$ p_ {sgd}(f \ mid s)$的第一阶决定因素,但仍然存在对超参数调谐敏感的二阶差异。基于$ p_ {sgd}(f \ mid s)$和/或$ p_b(f \ mid s)$的功能概率图片可以使建筑或超参数设置(例如批处理大小,学习率和优化者选择)的变化的方式散发出新的启示。
Overparameterised deep neural networks (DNNs) are highly expressive and so can, in principle, generate almost any function that fits a training dataset with zero error. The vast majority of these functions will perform poorly on unseen data, and yet in practice DNNs often generalise remarkably well. This success suggests that a trained DNN must have a strong inductive bias towards functions with low generalisation error. Here we empirically investigate this inductive bias by calculating, for a range of architectures and datasets, the probability $P_{SGD}(f\mid S)$ that an overparameterised DNN, trained with stochastic gradient descent (SGD) or one of its variants, converges on a function $f$ consistent with a training set $S$. We also use Gaussian processes to estimate the Bayesian posterior probability $P_B(f\mid S)$ that the DNN expresses $f$ upon random sampling of its parameters, conditioned on $S$. Our main findings are that $P_{SGD}(f\mid S)$ correlates remarkably well with $P_B(f\mid S)$ and that $P_B(f\mid S)$ is strongly biased towards low-error and low complexity functions. These results imply that strong inductive bias in the parameter-function map (which determines $P_B(f\mid S)$), rather than a special property of SGD, is the primary explanation for why DNNs generalise so well in the overparameterised regime. While our results suggest that the Bayesian posterior $P_B(f\mid S)$ is the first order determinant of $P_{SGD}(f\mid S)$, there remain second order differences that are sensitive to hyperparameter tuning. A function probability picture, based on $P_{SGD}(f\mid S)$ and/or $P_B(f\mid S)$, can shed new light on the way that variations in architecture or hyperparameter settings such as batch size, learning rate, and optimiser choice, affect DNN performance.