论文标题
Adahessian:机器学习的自适应二阶优化器
ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning
论文作者
论文摘要
我们介绍了Adahessian,这是一种二阶随机优化算法,该算法通过Hessian的自适应估计来动态地结合损失函数的曲率。与一阶方法(例如SGD和ADAM)相比,二阶算法是具有优质收敛属性的最强大优化算法之一。传统二阶方法的主要缺点是与一阶方法相比,它们的每次迭代计算较重,精度较差。为了解决这些问题,我们在Adahessian中结合了几种新型方法,包括:(i)基于Hutchinson的快速方法,可近似于低计算开销的曲率矩阵; (ii)一个根平方的指数移动平均值,以平滑不同迭代的黑石对角线的变化; (iii)一个块对角线平均,以减少黑森对角线元件的方差。我们表明,与其他自适应优化方法(包括亚当的变体)相比,Adahessian通过很大的边距实现了新的最新结果。特别是,我们对CV,NLP和建议系统任务进行了广泛的测试,并发现Adahessian:(i)在CIFAR10上,RESNETS20/32的精度提高了1.80%/1.45%,与Adam相比,ImageNet的精度提高了5.55%; (ii)在IWSLT14/WMT14上的变压器的表现优于变形金刚的ADAMW,而PTB/Wikitext-103上的bleu得分均优于ADAMW。 (iii)在胶水上胜过0.41分的squeezebert; (iv)在Criteo AD Kaggle数据集中,DLRM的分数比Adagrad获得0.032%。重要的是,我们表明,adahessian的迭代成本与一阶方法相当,并且对其超参数表现出稳健性。
We introduce ADAHESSIAN, a second order stochastic optimization algorithm which dynamically incorporates the curvature of the loss function via ADAptive estimates of the HESSIAN. Second order algorithms are among the most powerful optimization algorithms with superior convergence properties as compared to first order methods such as SGD and Adam. The main disadvantage of traditional second order methods is their heavier per iteration computation and poor accuracy as compared to first order methods. To address these, we incorporate several novel approaches in ADAHESSIAN, including: (i) a fast Hutchinson based method to approximate the curvature matrix with low computational overhead; (ii) a root-mean-square exponential moving average to smooth out variations of the Hessian diagonal across different iterations; and (iii) a block diagonal averaging to reduce the variance of Hessian diagonal elements. We show that ADAHESSIAN achieves new state-of-the-art results by a large margin as compared to other adaptive optimization methods, including variants of Adam. In particular, we perform extensive tests on CV, NLP, and recommendation system tasks and find that ADAHESSIAN: (i) achieves 1.80%/1.45% higher accuracy on ResNets20/32 on Cifar10, and 5.55% higher accuracy on ImageNet as compared to Adam; (ii) outperforms AdamW for transformers by 0.13/0.33 BLEU score on IWSLT14/WMT14 and 2.7/1.0 PPL on PTB/Wikitext-103; (iii) outperforms AdamW for SqueezeBert by 0.41 points on GLUE; and (iv) achieves 0.032% better score than Adagrad for DLRM on the Criteo Ad Kaggle dataset. Importantly, we show that the cost per iteration of ADAHESSIAN is comparable to first order methods, and that it exhibits robustness towards its hyperparameters.