Self-Distillation Amplifies Regularization in Hilbert Space

被引:0
|
作者
Mobahi, Hossein [1 ]
Farajtabar, Mehrdad [2 ]
Bartlett, Peter L. [1 ,3 ]
机构
[1] Google Res, Mountain View, CA 94043 USA
[2] DeepMind, Mountain View, CA USA
[3] Univ Calif Berkeley, Dept EECS, Berkeley, CA USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Knowledge distillation introduced in the deep learning context is a method to transfer knowledge from one architecture to another. In particular, when the architectures are identical, this is called self-distillation. The idea is to feed in predictions of the trained model as new target values for retraining (and iterate this loop possibly a few times). It has been empirically observed that the self-distilled model often achieves higher accuracy on held out data. Why this happens, however, has been a mystery: the self-distillation dynamics does not receive any new information about the task and solely evolves by looping over training. To the best of our knowledge, there is no rigorous understanding of why this happens. This work provides the first theoretical analysis of self-distillation. We focus on fitting a nonlinear function to training data, where the model space is Hilbert space and fitting is subject to ?(2) regularization in this function space. We show that self-distillation iterations modify regularization by progressively limiting the number of basis functions that can be used to represent the solution. This implies (as we also verify empirically) that while a few rounds of self-distillation may reduce over-fitting, further rounds may lead to under-fitting and thus worse performance.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] Self-Distillation With Augmentation in Feature Space
    Xu, Kai
    Wang, Lichun
    Li, Shuang
    Xin, Jianjia
    Yin, Baocai
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (10) : 9578 - 9590
  • [2] Generalization Self-distillation with Epoch-wise Regularization
    Xia, Yuelong
    Yang, Yun
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [3] Reverse Self-Distillation Overcoming the Self-Distillation Barrier
    Ni, Shuiping
    Ma, Xinliang
    Zhu, Mingfu
    Li, Xingwang
    Zhang, Yu-Dong
    IEEE OPEN JOURNAL OF THE COMPUTER SOCIETY, 2023, 4 : 195 - 205
  • [4] Self-Distillation from the Last Mini-Batch for Consistency Regularization
    Shen, Yiqing
    Xu, Liwu
    Yang, Yuzhe
    Li, Yaqian
    Guo, Yandong
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 11933 - 11942
  • [5] Self-distillation with model averaging
    Gu, Xiaozhe
    Zhang, Zixun
    Jin, Ran
    Goh, Rick Siow Mong
    Luo, Tao
    INFORMATION SCIENCES, 2025, 694
  • [6] Probabilistic online self-distillation
    Tzelepi, Maria
    Passalis, Nikolaos
    Tefas, Anastasios
    NEUROCOMPUTING, 2022, 493 : 592 - 604
  • [7] Iterative Graph Self-Distillation
    Zhang, Hanlin
    Lin, Shuai
    Liu, Weiyang
    Zhou, Pan
    Tang, Jian
    Liang, Xiaodan
    Xing, Eric P.
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (03) : 1161 - 1169
  • [8] Bayesian Optimization Meets Self-Distillation
    Lee, HyunJae
    Song, Heon
    Lee, Hyeonsoo
    Lee, Gi-hyeon
    Park, Suyeong
    Yoo, Donggeun
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 1696 - 1705
  • [9] Tolerant Self-Distillation for image classification
    Liu, Mushui
    Yu, Yunlong
    Ji, Zhong
    Han, Jungong
    Zhang, Zhongfei
    NEURAL NETWORKS, 2024, 174
  • [10] Restructuring the Teacher and Student in Self-Distillation
    Zheng, Yujie
    Wang, Chong
    Tao, Chenchen
    Lin, Sunqi
    Qian, Jiangbo
    Wu, Jiafei
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 5551 - 5563