R-Drop: Regularized Dropout for Neural Networks

被引:0
|
作者
Liang, Xiaobo [1 ]
Wu, Lijun [2 ]
Li, Juntao [1 ]
Wang, Yue [1 ]
Meng, Qi [2 ]
Qin, Tao [2 ]
Chen, Wei [2 ]
Zhang, Min [1 ]
Liu, Tie-Yan [2 ]
机构
[1] Soochow Univ, Suzhou, Peoples R China
[2] Microsoft Res Asia, Beijing, Peoples R China
来源
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021) | 2021年 / 34卷
基金
美国国家科学基金会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Dropout is a powerful and widely used technique to regularize the training of deep neural networks. Though effective and performing well, the randomness introduced by dropout causes unnegligible inconsistency between training and inference. In this paper, we introduce a simple consistency training strategy to regularize dropout, namely R-Drop, which forces the output distributions of different sub models generated by dropout to be consistent with each other. Specifically, for each training sample, R-Drop minimizes the bidirectional KL-divergence between the output distributions of two sub models sampled by dropout. Theoretical analysis reveals that R-Drop reduces the above inconsistency. Experiments on 5 widely used deep learning tasks (18 datasets in total), including neural machine translation, abstractive summarization, language understanding, language modeling, and image classification, show that R-Drop is universally effective. In particular, it yields substantial improvements when applied to fine-tune large-scale pre-trained models, e.g., ViT, RoBERTa-large, and BART, and achieves state-of-the-art (SOTA) performances with the vanilla Transformer model on WMT14 English -> German translation (30:91 BLEU) and WMT14 English -> French translation (43:95 BLEU), even surpassing models trained with extra large-scale data and expert-designed advanced variants of Transformer models. Our code is available at GitHub(2).
引用
收藏
页数:16
相关论文
共 50 条
  • [21] A General Approach to Dropout in Quantum Neural Networks
    Scala, Francesco
    Ceschini, Andrea
    Panella, Massimo
    Gerace, Dario
    ADVANCED QUANTUM TECHNOLOGIES, 2023,
  • [22] Dropout Rademacher complexity of deep neural networks
    Gao, Wei
    Zhou, Zhi-Hua
    SCIENCE CHINA-INFORMATION SCIENCES, 2016, 59 (07)
  • [23] Technical Note: An R package for fitting Bayesian regularized neural networks with applications in animal breeding
    Perez-Rodriguez, P.
    Gianola, D.
    Weigel, K. A.
    Rosa, G. J. M.
    Crossa, J.
    JOURNAL OF ANIMAL SCIENCE, 2013, 91 (08) : 3522 - 3531
  • [24] State-Regularized Recurrent Neural Networks
    Wang, Cheng
    Niepert, Mathias
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 97, 2019, 97
  • [25] Wasserstein distance regularized graph neural networks
    Shi, Yong
    Zheng, Lei
    Quan, Pei
    Niu, Lingfeng
    INFORMATION SCIENCES, 2024, 670
  • [26] Uninorm Based Regularized Fuzzy Neural Networks
    de Campos Souza, Paulo Vitor
    Lacerda Silva, Gustavo Rodrigues
    Bambirra Torres, Luiz Carlos
    PROCEEDINGS OF THE 2018 IEEE INTERNATIONAL CONFERENCE ON EVOLVING AND ADAPTIVE INTELLIGENT SYSTEMS (EAIS), 2018,
  • [27] Convolutional Neural Networks Regularized by Correlated Noise
    Dutta, Shamak
    Tripp, Bryan
    Taylor, Graham W.
    2018 15TH CONFERENCE ON COMPUTER AND ROBOT VISION (CRV), 2018, : 375 - 382
  • [28] Estimation of missing logs by regularized neural networks
    Saggaf, MM
    Nebrija, EL
    AAPG BULLETIN, 2003, 87 (08) : 1377 - 1389
  • [29] Revisiting spatial dropout for regularizing convolutional neural networks
    Sanghun Lee
    Chulhee Lee
    Multimedia Tools and Applications, 2020, 79 : 34195 - 34207
  • [30] Revisiting spatial dropout for regularizing convolutional neural networks
    Lee, Sanghun
    Lee, Chulhee
    MULTIMEDIA TOOLS AND APPLICATIONS, 2020, 79 (45-46) : 34195 - 34207