First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise

被引：0

作者：

Thanh Huy Nguyen ^{[1
]}

Simsekli, Umut ^{[1
,2
]}

Gurbuzbalaban, Mert ^{[3
]}

Richard, Gael ^{[1
]}

机构：

[1] Telecom Paris, Inst Polytech Paris, LTCI, Paris, France

[2] Univ Oxford, Dept Stat, Oxford, England

[3] Rutgers Business Sch, Dept Management Sci & Informat Syst, New Brunswick, NJ USA

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019) | 2019年 / 32卷

关键词：

SDES DRIVEN; DIFFERENTIAL-EQUATIONS; LEVY; UNIQUENESS;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Stochastic gradient descent (SGD) has been widely used in machine learning due to its computational efficiency and favorable generalization properties. Recently, it has been empirically demonstrated that the gradient noise in several deep learning settings admits a non-Gaussian, heavy-tailed behavior. This suggests that the gradient noise can be modeled by using alpha-stable distributions, a family of heavy-tailed distributions that appear in the generalized central limit theorem. In this context, SGD can be viewed as a discretization of a stochastic differential equation (SDE) driven by a Levy motion, and the metastability results for this SDE can then be used for illuminating the behavior of SGD, especially in terms of 'preferring wide minima'. While this approach brings a new perspective for analyzing SGD, it is limited in the sense that, due to the time discretization, SGD might admit a significantly different behavior than its continuous-time limit. Intuitively, the behaviors of these two systems are expected to be similar to each other only when the discretization step is sufficiently small; however, to the best of our knowledge, there is no theoretical understanding on how small the step-size should be chosen in order to guarantee that the discretized system inherits the properties of the continuous-time system. In this study, we provide formal theoretical analysis where we derive explicit conditions for the step-size such that the metastability behavior of the discrete-time system is similar to its continuous-time limit. We show that the behaviors of the two systems are indeed similar for small step-sizes and we identify how the error depends on the algorithm and problem parameters. We illustrate our results with simulations on a synthetic model and neural networks.

引用

页数：11

共 50 条

[11] Gradient-free methods for non-smooth convex stochastic optimization with heavy-tailed noise on convex compact
Kornilov, Nikita
Gasnikov, Alexander
Dvurechensky, Pavel
Dvinskikh, Darina
COMPUTATIONAL MANAGEMENT SCIENCE, 2023, 20 (01)
[12] Stable Process Approach to Analysis of Systems Under Heavy-Tailed Noise: Modeling and Stochastic Linearization
Kashima, Kenji
Aoyama, Hiroki
Ohta, Yoshito
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2019, 64 (04) : 1344 - 1357
[13] Correction to: analysis of stochastic gradient descent in continuous time
Latz, Jonas
STATISTICS AND COMPUTING, 2024, 34 (05)
[14] Revisiting the Noise Model of Stochastic Gradient Descent
Battash, Barak
Wolf, Lior
Lindenbaum, Ofir
INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 238, 2024, 238
[15] Stochastic Gradient Descent in Continuous Time
Sirignano, Justin
Spiliopoulos, Konstantinos
SIAM JOURNAL ON FINANCIAL MATHEMATICS, 2017, 8 (01): : 933 - 961
[16] Convergence Rates of Stochastic Gradient Descent under Infinite Noise Variance
Wang, Hongjian
Gurbuzbalaban, Mert
Zhu, Lingjiong
Simsekli, Umut
Erdogdu, Murat A.
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[17] Stochastic Gradient Descent with Noise of Machine Learning Type Part II: Continuous Time Analysis
Wojtowytsch S.
Journal of Nonlinear Science, 2024, 34 (01)
[18] Stochastic Gradient Descent with Noise of Machine Learning Type Part I: Discrete Time Analysis
Wojtowytsch, Stephan
JOURNAL OF NONLINEAR SCIENCE, 2023, 33 (03)
[19] Stochastic Gradient Descent with Noise of Machine Learning Type Part I: Discrete Time Analysis
Stephan Wojtowytsch
Journal of Nonlinear Science, 2023, 33
[20] Clipped Stochastic Methods for Variational Inequalities with Heavy-Tailed Noise
Gorbunov, Eduard
Danilova, Marina
Dobre, David
Dvurechensky, Pavel
Gasnikov, Alexander
Gidel, Gauthier
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,

← 1 2 3 4 5 →