R-Drop: Regularized Dropout for Neural Networks

被引:0
|
作者
Liang, Xiaobo [1 ]
Wu, Lijun [2 ]
Li, Juntao [1 ]
Wang, Yue [1 ]
Meng, Qi [2 ]
Qin, Tao [2 ]
Chen, Wei [2 ]
Zhang, Min [1 ]
Liu, Tie-Yan [2 ]
机构
[1] Soochow Univ, Suzhou, Peoples R China
[2] Microsoft Res Asia, Beijing, Peoples R China
来源
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021) | 2021年 / 34卷
基金
美国国家科学基金会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Dropout is a powerful and widely used technique to regularize the training of deep neural networks. Though effective and performing well, the randomness introduced by dropout causes unnegligible inconsistency between training and inference. In this paper, we introduce a simple consistency training strategy to regularize dropout, namely R-Drop, which forces the output distributions of different sub models generated by dropout to be consistent with each other. Specifically, for each training sample, R-Drop minimizes the bidirectional KL-divergence between the output distributions of two sub models sampled by dropout. Theoretical analysis reveals that R-Drop reduces the above inconsistency. Experiments on 5 widely used deep learning tasks (18 datasets in total), including neural machine translation, abstractive summarization, language understanding, language modeling, and image classification, show that R-Drop is universally effective. In particular, it yields substantial improvements when applied to fine-tune large-scale pre-trained models, e.g., ViT, RoBERTa-large, and BART, and achieves state-of-the-art (SOTA) performances with the vanilla Transformer model on WMT14 English -> German translation (30:91 BLEU) and WMT14 English -> French translation (43:95 BLEU), even surpassing models trained with extra large-scale data and expert-designed advanced variants of Transformer models. Our code is available at GitHub(2).
引用
收藏
页数:16
相关论文
共 50 条
  • [11] Manifold Regularized Deep Neural Networks
    Tomar, Vikrant Singh
    Rose, Richard C.
    15TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2014), VOLS 1-4, 2014, : 348 - 352
  • [12] Checkerboard Dropout: A Structured Dropout With Checkerboard Pattern for Convolutional Neural Networks
    Nguyen, Khanh-Binh
    Choi, Jaehyuk
    Yang, Joon-Sung
    IEEE ACCESS, 2022, 10 : 76044 - 76054
  • [13] LGST-Drop: label-guided structural dropout for spatial-temporal convolutional neural networks
    Cui, Hu
    Huang, Renjing
    Zhang, Ruoyu
    Huang, Chuhua
    JOURNAL OF ELECTRONIC IMAGING, 2022, 31 (03)
  • [14] Towards dropout training for convolutional neural networks
    Wu, Haibing
    Gu, Xiaodong
    NEURAL NETWORKS, 2015, 71 : 1 - 10
  • [15] Variational Dropout Sparsifies Deep Neural Networks
    Molchanov, Dmitry
    Ashukha, Arsenii
    Vetrov, Dmitry
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 70, 2017, 70
  • [16] Augmenting Recurrent Neural Networks Resilience by Dropout
    Bacciu, Davide
    Crecchi, Francesco
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2020, 31 (01) : 345 - 351
  • [17] Dropout Rademacher complexity of deep neural networks
    Wei GAO
    Zhi-Hua ZHOU
    Science China(Information Sciences), 2016, 59 (07) : 173 - 184
  • [18] Regularization of deep neural networks with spectral dropout
    Khan, Salman H.
    Hayat, Munawar
    Porikli, Fatih
    NEURAL NETWORKS, 2019, 110 : 82 - 90
  • [19] Dropout Rademacher complexity of deep neural networks
    Wei Gao
    Zhi-Hua Zhou
    Science China Information Sciences, 2016, 59
  • [20] Analysis on the Dropout Effect in Convolutional Neural Networks
    Park, Sungheon
    Kwak, Nojun
    COMPUTER VISION - ACCV 2016, PT II, 2017, 10112 : 189 - 204