论文标题
学习优化准牛顿方法
Learning to Optimize Quasi-Newton Methods
论文作者
论文摘要
基于快速梯度的优化算法对于机器学习模型的计算有效培训变得越来越重要。一种技术是将梯度乘以预处理矩阵以产生一个步骤,但是目前尚不清楚最好的预处理矩阵是什么。本文介绍了一个名为Lodo的新型机器学习优化器,该优化器在优化过程中试图在线元学习者成为最佳预定器。具体而言,我们的优化器将学习优化(L2O)技术与准Newton方法合并,以学习为神经网络参数化的预处理;在其他准Newton方法中,它们比预处理更灵活。与其他L2O方法不同,Lodo不需要在训练任务分布上进行任何元训练,而是学会了在优化测试任务的同时进行触发优化,并适应损失景观的局部特征,同时穿越它。从理论上讲,我们表明我们的优化器在嘈杂的损失景观中近似逆性黑森,并且能够代表广泛的逆Hessians。我们通过实验验证我们的算法是否可以在嘈杂的设置中进行优化,并证明代表逆黑板的更简单替代方案使性能恶化。最后,我们使用优化器来训练具有95K参数的半现实深神经网络,其速度可与标准神经网络优化器相当。
Fast gradient-based optimization algorithms have become increasingly essential for the computationally efficient training of machine learning models. One technique is to multiply the gradient by a preconditioner matrix to produce a step, but it is unclear what the best preconditioner matrix is. This paper introduces a novel machine learning optimizer called LODO, which tries to online meta-learn the best preconditioner during optimization. Specifically, our optimizer merges Learning to Optimize (L2O) techniques with quasi-Newton methods to learn preconditioners parameterized as neural networks; they are more flexible than preconditioners in other quasi-Newton methods. Unlike other L2O methods, LODO does not require any meta-training on a training task distribution, and instead learns to optimize on the fly while optimizing on the test task, adapting to the local characteristics of the loss landscape while traversing it. Theoretically, we show that our optimizer approximates the inverse Hessian in noisy loss landscapes and is capable of representing a wide range of inverse Hessians. We experimentally verify that our algorithm can optimize in noisy settings, and show that simpler alternatives for representing the inverse Hessians worsen performance. Lastly, we use our optimizer to train a semi-realistic deep neural network with 95k parameters at speeds comparable to those of standard neural network optimizers.