何时有助于或伤害概括？

论文标题

何时有助于或伤害概括？

When Does Preconditioning Help or Hurt Generalization?

论文作者

Amari, Shun-ichi, Ba, Jimmy, Grosse, Roger, Li, Xuechen, Nitanda, Atsushi, Suzuki, Taiji, Wu, Denny, Xu, Ji

论文摘要

尽管诸如自然梯度下降（NGD）之类的二阶优化器通常会加快优化的速度，但它们对泛化的影响已受到质疑。这项工作对一阶方法和二阶方法的\ textit {隐式偏见}的看法更为细微，这会影响泛化属性的比较。我们在一般的预处理$ \ boldsymbol {p} $下提供了过度参数无乘式回归的概括误差的确切渐近偏差变化分解，并考虑了一个特定的例子。我们确定偏置和差异的最佳$ \ boldsymbol {p} $，发现不同优化器的相对泛化性能取决于标签噪声和信号的“形状”（真实参数）：当标签嘈杂时，模型被误解了，或者信号与功能较低，ngd可以实现。相反，在干净标签，指定的模型或对齐信号下，GD比NGD更好地概括了。基于此分析，我们讨论了几种管理偏见差异权衡的方法，以及在GD和NGD之间插值的潜在好处。然后，我们将分析扩展到繁殖内核希尔伯特空间的回归，并证明预处理的GD可以比GD更快地降低人口风险。最后，我们从经验上比较了神经网络实验中一阶优化器的概括误差，并观察到与我们的理论分析相匹配的强大趋势。

While second order optimizers such as natural gradient descent (NGD) often speed up optimization, their effect on generalization has been called into question. This work presents a more nuanced view on how the \textit{implicit bias} of first- and second-order methods affects the comparison of generalization properties. We provide an exact asymptotic bias-variance decomposition of the generalization error of overparameterized ridgeless regression under a general class of preconditioner $\boldsymbol{P}$, and consider the inverse population Fisher information matrix (used in NGD) as a particular example. We determine the optimal $\boldsymbol{P}$ for both the bias and variance, and find that the relative generalization performance of different optimizers depends on the label noise and the "shape" of the signal (true parameters): when the labels are noisy, the model is misspecified, or the signal is misaligned with the features, NGD can achieve lower risk; conversely, GD generalizes better than NGD under clean labels, a well-specified model, or aligned signal. Based on this analysis, we discuss several approaches to manage the bias-variance tradeoff, and the potential benefit of interpolating between GD and NGD. We then extend our analysis to regression in the reproducing kernel Hilbert space and demonstrate that preconditioned GD can decrease the population risk faster than GD. Lastly, we empirically compare the generalization error of first- and second-order optimizers in neural network experiments, and observe robust trends matching our theoretical analysis.

下载PDF全文

下载文献需遵守相关版权规定

论文标题