Co-advise: Cross Inductive Bias Distillation

被引:25
|
作者
Ren, Sucheng [1 ,5 ]
Gao, Zhengqi [2 ]
Hua, Tianyu [3 ,5 ]
Xue, Zihui [4 ]
Tian, Yonglong [2 ]
He, Shengfeng [1 ]
Zhao, Hang [3 ,5 ]
机构
[1] South China Univ Technol, Guangzhou, Peoples R China
[2] MIT, Cambridge, MA 02139 USA
[3] Tsinghua Univ, Beijing, Peoples R China
[4] Univ Texas Austin, Austin, TX 78712 USA
[5] Shanghai Qi Zhi Inst, Shanghai, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
10.1109/CVPR52688.2022.01627
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The inductive bias of vision transformers is more relaxed that cannot work well with insufficient data. Knowledge distillation is thus introduced to assist the training of transformers. Unlike previous works, where merely heavy convolution-based teachers are provided, in this paper, we delve into the influence of models inductive biases in knowledge distillation (e.g., convolution and involution). Our key observation is that the teacher accuracy is not the dominant reason for the student accuracy, but the teacher inductive bias is more important. We demonstrate that lightweight teachers with different architectural inductive biases can be used to co-advise the student transformer with outstanding performances. The rationale behind is that models designed with different inductive biases tend to focus on diverse patterns, and teachers with different inductive biases attain various knowledge despite being trained on the same dataset. The diverse knowledge provides a more precise and comprehensive description of the data and compounds and boosts the performance of the student during distillation. Furthermore, we propose a token inductive bias alignment to align the inductive bias of the token with its target teacher model. With only lightweight teachers provided and using this cross inductive bias distillation method, our vision transformers (termed as CiT) outperform all previous vision transformers (ViT) of the same architecture on ImageNet. Moreover, our small size model CiT-SAK further achieves 82.7% Top-1 accuracy on ImageNet without modifying the attention module of the ViT. Code is available at https://github.com/OliverRensu/co-advise.
引用
收藏
页码:16752 / 16761
页数:10
相关论文
共 50 条
  • [1] INBIASED: INDUCTIVE BIAS DISTILLATION TO IMPROVE GENERALIZATION AND ROBUSTNESS THROUGH SHAPE-AWARENESS
    Gowda, Shruthi
    Zonooz, Bahram
    Arani, Elahe
    CONFERENCE ON LIFELONG LEARNING AGENTS, VOL 199, 2022, 199
  • [2] On the Inductive Bias of Dropout
    Helmbold, David P.
    Long, Philip M.
    JOURNAL OF MACHINE LEARNING RESEARCH, 2015, 16 : 3403 - 3454
  • [3] On the inductive bias of dropout
    Helmbold, David P.
    Long, Philip M.
    Journal of Machine Learning Research, 2015, 16 : 3403 - 3454
  • [4] Lifelong learning and inductive bias
    Amit, Ron
    Meir, Ron
    CURRENT OPINION IN BEHAVIORAL SCIENCES, 2019, 29 : 51 - 54
  • [5] The Inductive Bias of Quantum Kernels
    Kuebler, Jonas M.
    Buchholz, Simon
    Schoelkopf, Bernhard
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021,
  • [6] A Model of Inductive Bias Learning
    Baxter, Jonathan
    1600, Morgan Kaufmann Publishers (12):
  • [7] A model of inductive bias learning
    Baxter, J
    JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2000, 12 : 149 - 198
  • [8] Probing as Quantifying Inductive Bias
    Immer, Alexander
    Hennigen, Lucas Torroba
    Fortuin, Vincent
    Cotterell, Ryan
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 1839 - 1851
  • [9] INDUCTIVE POLICY - THE PRAGMATICS OF BIAS SELECTION
    PROVOST, FJ
    BUCHANAN, BG
    MACHINE LEARNING, 1995, 20 (1-2) : 35 - 61
  • [10] On the Inductive Bias of Neural Tangent Kernels
    Bietti, Alberto
    Mairal, Julien
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32