First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise

被引:0
|
作者
Thanh Huy Nguyen [1 ]
Simsekli, Umut [1 ,2 ]
Gurbuzbalaban, Mert [3 ]
Richard, Gael [1 ]
机构
[1] Telecom Paris, Inst Polytech Paris, LTCI, Paris, France
[2] Univ Oxford, Dept Stat, Oxford, England
[3] Rutgers Business Sch, Dept Management Sci & Informat Syst, New Brunswick, NJ USA
来源
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019) | 2019年 / 32卷
关键词
SDES DRIVEN; DIFFERENTIAL-EQUATIONS; LEVY; UNIQUENESS;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Stochastic gradient descent (SGD) has been widely used in machine learning due to its computational efficiency and favorable generalization properties. Recently, it has been empirically demonstrated that the gradient noise in several deep learning settings admits a non-Gaussian, heavy-tailed behavior. This suggests that the gradient noise can be modeled by using alpha-stable distributions, a family of heavy-tailed distributions that appear in the generalized central limit theorem. In this context, SGD can be viewed as a discretization of a stochastic differential equation (SDE) driven by a Levy motion, and the metastability results for this SDE can then be used for illuminating the behavior of SGD, especially in terms of 'preferring wide minima'. While this approach brings a new perspective for analyzing SGD, it is limited in the sense that, due to the time discretization, SGD might admit a significantly different behavior than its continuous-time limit. Intuitively, the behaviors of these two systems are expected to be similar to each other only when the discretization step is sufficiently small; however, to the best of our knowledge, there is no theoretical understanding on how small the step-size should be chosen in order to guarantee that the discretized system inherits the properties of the continuous-time system. In this study, we provide formal theoretical analysis where we derive explicit conditions for the step-size such that the metastability behavior of the discrete-time system is similar to its continuous-time limit. We show that the behaviors of the two systems are indeed similar for small step-sizes and we identify how the error depends on the algorithm and problem parameters. We illustrate our results with simulations on a synthetic model and neural networks.
引用
收藏
页数:11
相关论文
共 50 条
  • [31] Time series forecasting: problem of heavy-tailed distributed noise
    Marta Markiewicz
    Agnieszka Wyłomańska
    International Journal of Advances in Engineering Sciences and Applied Mathematics, 2021, 13 : 248 - 256
  • [32] Time series forecasting: problem of heavy-tailed distributed noise
    Markiewicz, Marta
    Wylomanska, Agnieszka
    INTERNATIONAL JOURNAL OF ADVANCES IN ENGINEERING SCIENCES AND APPLIED MATHEMATICS, 2021, 13 (2-3) : 248 - 256
  • [33] Stochastic Compositional Gradient Descent Under Compositional Constraints
    Thomdapu, Srujan Teja
    Vardhan, Harsh
    Rajawat, Ketan
    IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2023, 71 : 1115 - 1127
  • [34] Noise and Fluctuation of Finite Learning Rate Stochastic Gradient Descent
    Liu, Kangqiao
    Liu Ziyin
    Ueda, Masahito
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [35] Differentially private stochastic gradient descent with low-noise
    Wang, Puyu
    Lei, Yunwen
    Ying, Yiming
    Zhou, Ding-Xuan
    NEUROCOMPUTING, 2024, 587
  • [36] Distributed Stochastic Strongly Convex Optimization under Heavy-Tailed Noises
    Sun, Chao
    Chen, Bo
    2024 IEEE INTERNATIONAL CONFERENCE ON CYBERNETICS AND INTELLIGENT SYSTEMS, CIS AND IEEE INTERNATIONAL CONFERENCE ON ROBOTICS, AUTOMATION AND MECHATRONICS, RAM, CIS-RAM 2024, 2024, : 150 - 155
  • [37] DEA model considering outputs with stochastic noise and a heavy-tailed (stable) distribution
    Naseri, Hassan
    Najafi, S. Esmaeil
    Saghaei, Abbas
    INFOR, 2020, 58 (01) : 87 - 108
  • [38] Distributed stochastic Nash equilibrium seeking under heavy-tailed noises
    Sun, Chao
    Chen, Bo
    Wang, Jianzheng
    Wang, Zheming
    Yu, Li
    AUTOMATICA, 2025, 173
  • [39] Analysis of polynomial FM signals corrupted by heavy-tailed noise
    Barkat, B
    Stankovic, L
    SIGNAL PROCESSING, 2004, 84 (01) : 69 - 75
  • [40] Convergence analysis of distributed stochastic gradient descent with shuffling
    Meng, Qi
    Chen, Wei
    Wang, Yue
    Ma, Zhi-Ming
    Liu, Tie-Yan
    NEUROCOMPUTING, 2019, 337 : 46 - 57