Toward moderate overparameterization: Global convergence guarantees for training shallow neural networks

被引:114
|
作者
Oymak S. [1 ]
Soltanolkotabi M. [2 ]
机构
[1] The Department of Electrical and Computer Engineering, University of California at Riverside, Riverside, 92521, CA
[2] The Ming Hsieh Department of Electrical Engineering, University of Southern California, Los Angeles, 90089, CA
来源
Soltanolkotabi, Mahdi (msoltoon@gmail.com) | 1600年 / Institute of Electrical and Electronics Engineers Inc.卷 / 01期
基金
美国国家科学基金会;
关键词
Neural network training; Nonconvex optimization; Overparameterization; Random matrix theory;
D O I
10.1109/JSAIT.2020.2991332
中图分类号
学科分类号
摘要
Many modern neural network architectures are trained in an overparameterized regime where the parameters of the model exceed the size of the training dataset. Sufficiently overparameterized neural network architectures in principle have the capacity to fit any set of labels including random noise. However, given the highly nonconvex nature of the training landscape it is not clear what level and kind of overparameterization is required for first order methods to converge to a global optima that perfectly interpolate any labels. A number of recent theoretical works have shown that for very wide neural networks where the number of hidden units is polynomially large in the size of the training data gradient descent starting from a random initialization does indeed converge to a global optima. However, in practice much more moderate levels of overparameterization seems to be sufficient and in many cases overparameterized models seem to perfectly interpolate the training data as soon as the number of parameters exceed the size of the training data by a constant factor. Thus there is a huge gap between the existing theoretical literature and practical experiments. In this paper we take a step towards closing this gap. Focusing on shallow neural nets and smooth activations, we show that (stochastic) gradient descent when initialized at random converges at a geometric rate to a nearby global optima as soon as the square-root of the number of network parameters exceeds the size of the training data. Our results also benefit from a fast convergence rate and continue to hold for non-differentiable activations such as Rectified Linear Units (ReLUs). © 2020 IEEE.
引用
收藏
页码:84 / 105
页数:21
相关论文
共 50 条
  • [1] Subquadratic Overparameterization for Shallow Neural Networks
    Song, Chaehwan
    Ramezani-Kebrya, Ali
    Pethick, Thomas
    Eftekhari, Armin
    Cevher, Volkan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021,
  • [2] On the Impact of Overparameterization on the Training of a Shallow Neural Network in High Dimensions
    Martin, Simon
    Bach, Francis
    Biroli, Giulio
    INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 238, 2024, 238
  • [3] A Global Convergence PSO Training Algorithm of Neural Networks
    Li, Ming
    Li, Wei
    Yang, Cheng-wu
    2010 8TH WORLD CONGRESS ON INTELLIGENT CONTROL AND AUTOMATION (WCICA), 2010, : 3261 - 3265
  • [4] Early Stage Convergence and Global Convergence of Training Mildly Parameterized Neural Networks
    Wang, Mingze
    Ma, Chao
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [5] Generalization Guarantees of Gradient Descent for Shallow Neural Networks
    Wang, Puyu
    Lei, Yunwen
    Wang, Di
    Ying, Yiming
    Zhou, Ding-Xuan
    NEURAL COMPUTATION, 2025, 37 (02) : 344 - 402
  • [6] Global Convergence Analysis of Local SGD for Two-layer Neural Network without Overparameterization
    Bao, Yajie
    Shehu, Amarda
    Liu, Mingrui
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [7] Convergence and Recovery Guarantees of Unsupervised Neural Networks for Inverse Problems
    Buskulic, Nathan
    Fadili, Jalal
    Queau, Yvain
    JOURNAL OF MATHEMATICAL IMAGING AND VISION, 2024, 66 (04) : 584 - 605
  • [8] Global convergence of training methods for neural networks based on the state-estimation
    Tsumura, T
    Tatsumi, K
    Tanino, T
    SICE 2003 ANNUAL CONFERENCE, VOLS 1-3, 2003, : 1266 - 1271
  • [9] Post-training Quantization for Neural Networks with Provable Guarantees*
    Zhang, Jinjie
    Zhou, Yixuan
    Saab, Rayan
    SIAM JOURNAL ON MATHEMATICS OF DATA SCIENCE, 2023, 5 (02): : 373 - 399
  • [10] Convergence of Adversarial Training in Overparametrized Neural Networks
    Gao, Ruiqi
    Cai, Tianle
    Li, Haochuan
    Wang, Liwei
    Hsieh, Cho-Jui
    Lee, Jason D.
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32