Widening the bottleneck of lexical choice for non-autoregressive translation

被引:0
|
作者
Ding, Liang [1 ]
Wang, Longyue [2 ]
Liu, Siyou [3 ]
Luo, Weihua [2 ]
Zhang, Kaifu [2 ]
机构
[1] Univ Sydney, Sydney, Australia
[2] Alibaba Int Digital Commerce, Hangzhou, Peoples R China
[3] Univ Macau, Macau, Peoples R China
来源
关键词
Lexical choice; Non-autoregressive translation; Low-frequency word; Knowledge distillation; New benchmark;
D O I
10.1016/j.csl.2024.101765
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, non-autoregressive models have enjoyed great popularity in natural language processing (NLP) communities, and slowly crept into the main body of research such as speech recognition and computer vision. Non-autoregressive translation (NAT) has been proposed to improve the decoding efficiency of translation models by predicting all tokens independently and simultaneously. To reduce the complexity of the raw data, knowledge distillation (KD) is the preliminary step for training NAT models by leveraging autoregressive translation (AT). In this study, we first reveal that the discrepancy between the raw and the KD data leads to lexical choice errors on predicting low-frequency words. Then we bridge the gap by exploiting three architecture-free approaches without introducing any computational cost: (1) Model Level, where we introduce an extra Kullback-Leibler divergence term derived by comparing the lexical choice of NAT model and that embedded in the raw data; (2) Parallel Data Level, where we reactivate low-frequency information by proposing raw pre-training and reverse KD training; (3) Monolingual Data Level, where we transfer both the knowledge of the bilingual raw data and that of the new monolingual data to the NAT model. We conduct experiments on widely-used NAT benchmarks (i.e. WMT14 English-German and WMT16 Romanian-English) over two advanced NAT architectures. Results demonstrate that the proposed approaches can significantly and universally improve translation quality by reducing translation errors on low-frequency words. Extensive analyses demonstrate that (1) these approach generates translations that contain more low-frequency words; (2) these techniques can be used together profitably to further recall the useful information lost in the standard KD; (3) enlarging the monolingual data consistently improves the BLEU scores, while this trend does not hold when further scaling the monolingual data. To this end, we establish a new NAT benchmarks by validating our approaches on three additional datasets varying from languages and scales (i.e. WMT17 Chinese-English, WMT19 English-German and WAT17 Japanese-English). We will release data, code and models, which we hope can significantly promote research in this field.2
引用
收藏
页数:16
相关论文
共 50 条
  • [31] Aligned Cross Entropy for Non-Autoregressive Machine Translation
    Ghazvininejad, Marjan
    Karpukhin, Vladimir
    Zettlemoyer, Luke
    Levy, Omer
    25TH AMERICAS CONFERENCE ON INFORMATION SYSTEMS (AMCIS 2019), 2019,
  • [32] Uncertainty-aware non-autoregressive neural machine translation
    Liu, Chuanming
    Yu, Jingqi
    COMPUTER SPEECH AND LANGUAGE, 2023, 78
  • [33] Multilingual Non-Autoregressive Machine Translation without Knowledge Distillation
    Huang, Chenyang
    Huang, Fei
    Zheng, Zaixiang
    Zaiane, Osmar
    Zhou, Hao
    Mou, Lili
    13TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING AND THE 3RD CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, IJCNLP-AACL 2023, 2023, : 161 - 170
  • [34] Selective Knowledge Distillation for Non-Autoregressive Neural Machine Translation
    Liu, Min
    Bao, Yu
    Zhao, Chengqi
    Huang, Shujian
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 11, 2023, : 13246 - 13254
  • [35] Improving Non-autoregressive Neural Machine Translation with Monolingual Data
    Zhou, Jiawei
    Keung, Phillip
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 1893 - 1898
  • [36] Non-autoregressive neural machine translation with auxiliary representation fusion
    Du, Quan
    Feng, Kai
    Xu, Chen
    Xiao, Tong
    Zhu, Jingbo
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2021, 41 (06) : 7229 - 7239
  • [37] NON-AUTOREGRESSIVE MACHINE TRANSLATION WITH A NOVEL MASKED LANGUAGE MODEL
    Li Ke
    Li Jie
    Wangjun
    2022 19TH INTERNATIONAL COMPUTER CONFERENCE ON WAVELET ACTIVE MEDIA TECHNOLOGY AND INFORMATION PROCESSING (ICCWAMTIP), 2022,
  • [38] Hint-Based Training for Non-Autoregressive Machine Translation
    Li, Zhuohan
    Lin, Zi
    He, Di
    Tian, Fei
    Qin, Tao
    Wang, Liwei
    Liu, Tie-Yan
    2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 5708 - 5713
  • [39] Fully Non-autoregressive Neural Machine Translation: Tricks of the Trade
    Gu, Jiatao
    Kong, Xiang
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 120 - 133
  • [40] A Survey on Non-Autoregressive Generation for Neural Machine Translation and Beyond
    Xiao Y.
    Wu L.
    Guo J.
    Li J.
    Zhang M.
    Qin T.
    Liu T.-Y.
    IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45 (10) : 11407 - 11427