论文标题
Hausdorff维度,沉重的尾巴和神经网络中的概括
Hausdorff Dimension, Heavy Tails, and Generalization in Neural Networks
论文作者
论文摘要
尽管它在广泛的应用中取得了成功,但表征了随机梯度下降(SGD)在非凸深的深度学习问题中的泛化特性仍然是一个重要的挑战。在重型梯度噪声下通过随机微分方程(SDE)对SGD的轨迹进行建模,最近对SGD的几个特殊特性进行了启示,但仍缺少对此类SDES在学习理论框架中对此类SDE的通用性能的严格处理。为了弥合这一差距,在本文中,我们证明了SGD的概括范围,假设其轨迹可以通过\ emph {feller Process}对其进行良好的氧化范围,该过程定义了包括最近的SDE代表(Brownian或重型尾巴)的多个Markov过程,作为其特殊情况。我们表明,概括误差可以由轨迹的\ emph {hausdorff dimension}控制,该轨迹与驾驶过程的尾巴行为密切相关。我们的结果表明,较重的尾部过程应实现更好的概括。因此,该过程的尾部索引可以用作“容量度量”的概念。我们通过对深神经网络的实验来支持我们的理论,这些实验表明所提出的能力指标可以准确估计概括误差,并且与文献中现有的容量指标不同,它不一定会随参数的数量增长。
Despite its success in a wide range of applications, characterizing the generalization properties of stochastic gradient descent (SGD) in non-convex deep learning problems is still an important challenge. While modeling the trajectories of SGD via stochastic differential equations (SDE) under heavy-tailed gradient noise has recently shed light over several peculiar characteristics of SGD, a rigorous treatment of the generalization properties of such SDEs in a learning theoretical framework is still missing. Aiming to bridge this gap, in this paper, we prove generalization bounds for SGD under the assumption that its trajectories can be well-approximated by a \emph{Feller process}, which defines a rich class of Markov processes that include several recent SDE representations (both Brownian or heavy-tailed) as its special case. We show that the generalization error can be controlled by the \emph{Hausdorff dimension} of the trajectories, which is intimately linked to the tail behavior of the driving process. Our results imply that heavier-tailed processes should achieve better generalization; hence, the tail-index of the process can be used as a notion of "capacity metric". We support our theory with experiments on deep neural networks illustrating that the proposed capacity metric accurately estimates the generalization error, and it does not necessarily grow with the number of parameters unlike the existing capacity metrics in the literature.