Preconditioned Stochastic Gradient Descent

被引:58
|
作者
Li, Xi-Lin [1 ,2 ,3 ]
机构
[1] Univ Maryland Baltimore Cty, Machine Learning Signal Proc Lab, Baltimore, MD 21228 USA
[2] Fortemedia Inc, Santa Clara, CA USA
[3] Cisco Syst Inc, San Jose, CA USA
关键词
Neural network; Newton method; nonconvex optimization; preconditioner; stochastic gradient descent (SGD);
D O I
10.1109/TNNLS.2017.2672978
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Stochastic gradient descent (SGD) still is the workhorse for many practical problems. However, it converges slow, and can be difficult to tune. It is possible to precondition SGD to accelerate its convergence remarkably. But many attempts in this direction either aim at solving specialized problems, or result in significantly more complicated methods than SGD. This paper proposes a new method to adaptively estimate a preconditioner, such that the amplitudes of perturbations of preconditioned stochastic gradient match that of the perturbations of parameters to be optimized in a way comparable to Newton method for deterministic optimization. Unlike the preconditioners based on secant equation fitting as done in deterministic quasi-Newton methods, which assume positive definite Hessian and approximate its inverse, the new preconditioner works equally well for both convex and nonconvex optimizations with exact or noisy gradients. When stochastic gradient is used, it can naturally damp the gradient noise to stabilize SGD. Efficient preconditioner estimation methods are developed, and with reasonable simplifications, they are applicable to large-scale problems. Experimental results demonstrate that equipped with the new preconditioner, without any tuning effort, preconditioned SGD can efficiently solve many challenging problems like the training of a deep neural network or a recurrent neural network requiring extremely long-term memories.
引用
收藏
页码:1454 / 1466
页数:13
相关论文
共 50 条
  • [31] Stochastic Gradient Descent on Riemannian Manifolds
    Bonnabel, Silvere
    IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2013, 58 (09) : 2217 - 2229
  • [32] Conjugate directions for stochastic gradient descent
    Schraudolph, NN
    Graepel, T
    ARTIFICIAL NEURAL NETWORKS - ICANN 2002, 2002, 2415 : 1351 - 1356
  • [33] STOCHASTIC MODIFIED FLOWS FOR RIEMANNIAN STOCHASTIC GRADIENT DESCENT
    Gess, Benjamin
    Kassing, Sebastian
    Rana, Nimit
    SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 2024, 62 (06) : 3288 - 3314
  • [34] A Stochastic Gradient Descent Approach for Stochastic Optimal Control
    Archibald, Richard
    Bao, Feng
    Yong, Jiongmin
    EAST ASIAN JOURNAL ON APPLIED MATHEMATICS, 2020, 10 (04) : 635 - 658
  • [35] Stochastic modified equations for the asynchronous stochastic gradient descent
    An, Jing
    Lu, Jianfeng
    Ying, Lexing
    INFORMATION AND INFERENCE-A JOURNAL OF THE IMA, 2020, 9 (04) : 851 - 873
  • [36] Preconditioned Gradient Descent for Over-Parameterized Nonconvex Matrix Factorization
    Zhang, Gavin
    Fattahi, Salar
    Zhang, Richard Y.
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [37] Transformers learn to implement preconditioned gradient descent for in-context learning
    Ahn, Kwangjun
    Cheng, Xiang
    Daneshmand, Hadi
    Sra, Suvrit
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [38] Preconditioned Gradient Descent Algorithm for Inverse Filtering on Spatially Distributed Networks
    Cheng, Cheng
    Emirov, Nazar
    Sun, Qiyu
    IEEE SIGNAL PROCESSING LETTERS, 2020, 27 : 1834 - 1838
  • [39] Accelerating the Iteratively Preconditioned Gradient-Descent Algorithm using Momentum
    Liu, Tianchen
    Chakrabarti, Kushal
    Chopra, Nikhil
    2023 NINTH INDIAN CONTROL CONFERENCE, ICC, 2023, : 68 - 73
  • [40] On the convergence and improvement of stochastic normalized gradient descent
    Shen-Yi ZHAO
    Yin-Peng XIE
    Wu-Jun LI
    ScienceChina(InformationSciences), 2021, 64 (03) : 105 - 117