Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training

被引:0
|
作者
Zheng, Bo [1 ,2 ]
Dong, Li [2 ]
Huang, Shaohan [2 ]
Singhal, Saksham [2 ]
Che, Wanxiang [1 ]
Liu, Ting [1 ]
Song, Xia [2 ]
Wei, Furu [2 ]
机构
[1] Harbin Inst Technol, Harbin, Peoples R China
[2] Microsoft Corp, Redmond, WA 98052 USA
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Compared to monolingual models, cross-lingual models usually require a more expressive vocabulary to represent all languages adequately. We find that many languages are under-represented in recent cross-lingual language models due to the limited vocabulary capacity. To this end, we propose an algorithm VoCAP to determine the desired vocabulary capacity of each language. However, increasing the vocabulary size significantly slows down the pre-training speed. In order to address the issues, we propose k-NN-based target sampling to accelerate the expensive softmax. Our experiments show that the multilingual vocabulary learned with VoCAP benefits cross-lingual language model pre-training Moreover, k-NN-based target sampling mitigates the side-effects of increasing the vocabulary size while achieving comparable performance and faster pre-training speed. The code and the pretrained multilingual vocabularies are available at https://github.com/bozheng-hit/VoCapXLM.
引用
收藏
页码:3203 / 3215
页数:13
相关论文
共 50 条
  • [41] Enhancing Cross-lingual Natural Language Inference by Prompt-learning from Cross-lingual Templates
    Qi, Kunxun
    Wan, Hai
    Du, Jianfeng
    Chen, Haolan
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 1910 - 1923
  • [42] Cross-Lingual Training for Automatic Question Generation
    Kumar, Vishwajeet
    Joshi, Nitish
    Mukherjee, Arijit
    Ramakrishnan, Ganesh
    Jyothi, Preethi
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 4863 - 4872
  • [43] Exploring Cross-lingual Textual Style Transfer with Large Multilingual Language Models
    Moskovskiy, Daniil
    Dementieva, Daryna
    Panchenko, Alexander
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022): STUDENT RESEARCH WORKSHOP, 2022, : 346 - 354
  • [44] FlauBERT: Unsupervised Language Model Pre-training for French
    Le, Hang
    Vial, Loic
    Frej, Jibril
    Segonne, Vincent
    Coavoux, Maximin
    Lecouteux, Benjamin
    Allauzen, Alexandre
    Crabbe, Benoit
    Besacier, Laurent
    Schwab, Didier
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 2479 - 2490
  • [45] Soft Language Clustering for Multilingual Model Pre-training
    Zeng, Jiali
    Jiang, Yufan
    Yin, Yongjing
    Jing, Yi
    Meng, Fandong
    Lin, Binghuai
    Cao, Yunbo
    Zhou, Jie
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 7021 - 7035
  • [46] Optimization of Cross-Lingual LSI Training Data
    Pozniak, John
    Bradford, Roger
    COMPUTER AND INFORMATION SCIENCE 2015, 2016, 614 : 57 - 73
  • [47] Cross-Lingual Phone Mapping for Large Vocabulary Speech Recognition of Under-Resourced Languages
    Van Hai Do
    Xiao, Xiong
    Chng, Eng Siong
    Li, Haizhou
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2014, E97D (02): : 285 - 295
  • [48] XCODE: Towards Cross-Language Code Representation with Large-Scale Pre-Training
    Lin, Zehao
    Li, Guodun
    Zhang, Jingfeng
    Deng, Yue
    Zeng, Xiangji
    Zhang, Yin
    Wan, Yao
    ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY, 2022, 31 (03)
  • [49] Unified Language Model Pre-training for Natural Language Understanding and Generation
    Dong, Li
    Yang, Nan
    Wang, Wenhui
    Wei, Furu
    Liu, Xiaodong
    Wang, Yu
    Gao, Jianfeng
    Zhou, Ming
    Hon, Hsiao-Wuen
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [50] Model Selection for Cross-Lingual Transfer
    Chen, Yang
    Ritter, Alan
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 5675 - 5687