Preconditioned Stochastic Gradient Descent

被引:58
|
作者
Li, Xi-Lin [1 ,2 ,3 ]
机构
[1] Univ Maryland Baltimore Cty, Machine Learning Signal Proc Lab, Baltimore, MD 21228 USA
[2] Fortemedia Inc, Santa Clara, CA USA
[3] Cisco Syst Inc, San Jose, CA USA
关键词
Neural network; Newton method; nonconvex optimization; preconditioner; stochastic gradient descent (SGD);
D O I
10.1109/TNNLS.2017.2672978
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Stochastic gradient descent (SGD) still is the workhorse for many practical problems. However, it converges slow, and can be difficult to tune. It is possible to precondition SGD to accelerate its convergence remarkably. But many attempts in this direction either aim at solving specialized problems, or result in significantly more complicated methods than SGD. This paper proposes a new method to adaptively estimate a preconditioner, such that the amplitudes of perturbations of preconditioned stochastic gradient match that of the perturbations of parameters to be optimized in a way comparable to Newton method for deterministic optimization. Unlike the preconditioners based on secant equation fitting as done in deterministic quasi-Newton methods, which assume positive definite Hessian and approximate its inverse, the new preconditioner works equally well for both convex and nonconvex optimizations with exact or noisy gradients. When stochastic gradient is used, it can naturally damp the gradient noise to stabilize SGD. Efficient preconditioner estimation methods are developed, and with reasonable simplifications, they are applicable to large-scale problems. Experimental results demonstrate that equipped with the new preconditioner, without any tuning effort, preconditioned SGD can efficiently solve many challenging problems like the training of a deep neural network or a recurrent neural network requiring extremely long-term memories.
引用
收藏
页码:1454 / 1466
页数:13
相关论文
共 50 条
  • [41] STOCHASTIC GRADIENT DESCENT WITH FINITE SAMPLES SIZES
    Yuan, Kun
    Ying, Bicheng
    Vlaski, Stefan
    Sayed, Ali H.
    2016 IEEE 26TH INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2016,
  • [42] Predicting Throughput of Distributed Stochastic Gradient Descent
    Li, Zhuojin
    Paolieri, Marco
    Golubchik, Leana
    Lin, Sung-Han
    Yan, Wumo
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2022, 33 (11) : 2900 - 2912
  • [43] Stochastic Multiple Target Sampling Gradient Descent
    Phan, Hoang
    Tran, Ngoc N.
    Le, Trung
    Tran, Toan
    Ho, Nhat
    Phung, Dinh
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [44] Convergent Stochastic Almost Natural Gradient Descent
    Sanchez-Lopez, Borja
    Cerquides, Jesus
    ARTIFICIAL INTELLIGENCE RESEARCH AND DEVELOPMENT, 2019, 319 : 54 - 63
  • [45] Linear Convergence of Adaptive Stochastic Gradient Descent
    Xie, Yuege
    Wu, Xiaoxia
    Ward, Rachel
    arXiv, 2019,
  • [46] Revisiting the Noise Model of Stochastic Gradient Descent
    Battash, Barak
    Wolf, Lior
    Lindenbaum, Ofir
    INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 238, 2024, 238
  • [47] Stochastic Gradient Descent as Approximate Bayesian Inference
    Mandt, Stephan
    Hoffman, Matthew D.
    Blei, David M.
    JOURNAL OF MACHINE LEARNING RESEARCH, 2017, 18
  • [48] Stochastic gradient descent with differentially private updates
    Song, Shuang
    Chaudhuri, Kamalika
    Sarwate, Anand D.
    2013 IEEE GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING (GLOBALSIP), 2013, : 245 - 248
  • [49] Variance Reduced Stochastic Gradient Descent with Neighbors
    Hofmann, Thomas
    Lucchi, Aurelien
    Lacoste-Julien, Simon
    McWilliams, Brian
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 28 (NIPS 2015), 2015, 28
  • [50] Local gain adaptation in stochastic gradient descent
    Schraudolph, NN
    NINTH INTERNATIONAL CONFERENCE ON ARTIFICIAL NEURAL NETWORKS (ICANN99), VOLS 1 AND 2, 1999, (470): : 569 - 574