THE INTERPOLATION PHASE TRANSITION IN NEURAL NETWORKS: MEMORIZATION AND GENERALIZATION UNDER LAZY TRAINING

被引:19
|
作者
Montanari, Andrea [1 ]
Zhong, Yiqiao
机构
[1] Stanford Univ, Dept Elect Engn, Stanford, CA 94305 USA
来源
ANNALS OF STATISTICS | 2022年 / 50卷 / 05期
关键词
Neural tangent kernel; memorization; overfitting; overparametrization; kernel ridge regression; DESCENT; MODELS;
D O I
10.1214/22-AOS2211
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Modern neural networks are often operated in a strongly overparametrized regime: they comprise so many parameters that they can interpolate the training set, even if actual labels are replaced by purely random ones. Despite this, they achieve good prediction error on unseen data: interpolating the training set does not lead to a large generalization error. Further, overparametrization appears to be beneficial in that it simplifies the optimization landscape. Here, we study these phenomena in the context of two-layers neural networks in the neural tangent (NT) regime. We consider a simple data model, with isotropic covariates vectors in d dimensions, and N hidden neurons. We assume that both the sample size n and the dimension d are large, and they are polynomially related. Our first main result is a characterization of the eigenstructure of the empirical NT kernel in the overparametrized regime Nd >> n. This characterization implies as a corollary that the minimum eigenvalue of the empirical NT kernel is bounded away from zero as soon as Nd >> n and, therefore, the network can exactly interpolate arbitrary labels in the same regime. Our second main result is a characterization of the generalization error of NT ridge regression including, as a special case, min-l(2) norm interpolation. We prove that, as soon as Nd >> n, the test error is well approximated by the one of kernel ridge regression with respect to the infinite-width kernel. The latter is in turn well approximated by the error of polynomial ridge regression, whereby the regularization parameter is increased by a "self-induced" term related to the high-degree components of the activation function. The polynomial degree depends on the sample size and the dimension (in particular on log n/ log d).
引用
收藏
页码:2816 / 2847
页数:32
相关论文
共 50 条
  • [1] Training Neural Networks for and by Interpolation
    Berrada, Leonard
    Zisserman, Andrew
    Kumar, M. Pawan
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 119, 2020, 119
  • [2] Lazy training of radial basis neural networks
    Valls, Jose M.
    Galvan, Ines M.
    Isasi, Pedro
    ARTIFICIAL NEURAL NETWORKS - ICANN 2006, PT 1, 2006, 4131 : 198 - 207
  • [3] INVESTIGATING GENERALIZATION IN NEURAL NETWORKS UNDER OPTIMALLY EVOLVED TRAINING PERTURBATIONS
    Chaudhury, Subhajit
    Yamasaki, Toshihiko
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 3617 - 3621
  • [4] Disentangling feature and lazy training in deep neural networks
    Geiger, Mario
    Spigler, Stefano
    Jacot, Arthur
    Wyart, Matthieu
    JOURNAL OF STATISTICAL MECHANICS-THEORY AND EXPERIMENT, 2020, 2020 (11):
  • [5] MEMORIZATION CAPACITY OF DEEP NEURAL NETWORKS UNDER PARAMETER QUANTIZATION
    Boo, Yoonho
    Shin, Sungho
    Sung, Wonyong
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 1383 - 1387
  • [6] Training and Generalization Errors for Underparameterized Neural Networks
    Martin Xavier, Daniel
    Chamoin, Ludovic
    Fribourg, Laurent
    IEEE CONTROL SYSTEMS LETTERS, 2023, 7 : 3926 - 3931
  • [7] Limitations of Lazy Training of Two-layers Neural Networks
    Ghorbani, Behrooz
    Mei, Song
    Misiakiewicz, Theodor
    Montanari, Andrea
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [8] Understanding activation patterns in artificial neural networks by exploring stochastic processes: Discriminating generalization from memorization
    Lehmler, Stephan Johann
    Saif-ur-Rehman, Muhammad
    Glasmachers, Tobias
    Iossifidis, Ioannis
    NEUROCOMPUTING, 2024, 610
  • [9] Training neural networks to encode symbols enables combinatorial generalization
    Vankov, Ivan I.
    Bowers, Jeffrey S.
    PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY B-BIOLOGICAL SCIENCES, 2020, 375 (1791)
  • [10] Training feedforward neural networks: An algorithm giving improved generalization
    Lee, CW
    NEURAL NETWORKS, 1997, 10 (01) : 61 - 68