Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training

被引:0
|
作者
Zheng, Bo [1 ,2 ]
Dong, Li [2 ]
Huang, Shaohan [2 ]
Singhal, Saksham [2 ]
Che, Wanxiang [1 ]
Liu, Ting [1 ]
Song, Xia [2 ]
Wei, Furu [2 ]
机构
[1] Harbin Inst Technol, Harbin, Peoples R China
[2] Microsoft Corp, Redmond, WA 98052 USA
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Compared to monolingual models, cross-lingual models usually require a more expressive vocabulary to represent all languages adequately. We find that many languages are under-represented in recent cross-lingual language models due to the limited vocabulary capacity. To this end, we propose an algorithm VoCAP to determine the desired vocabulary capacity of each language. However, increasing the vocabulary size significantly slows down the pre-training speed. In order to address the issues, we propose k-NN-based target sampling to accelerate the expensive softmax. Our experiments show that the multilingual vocabulary learned with VoCAP benefits cross-lingual language model pre-training Moreover, k-NN-based target sampling mitigates the side-effects of increasing the vocabulary size while achieving comparable performance and faster pre-training speed. The code and the pretrained multilingual vocabularies are available at https://github.com/bozheng-hit/VoCapXLM.
引用
收藏
页码:3203 / 3215
页数:13
相关论文
共 50 条
  • [21] Product-oriented Machine Translation with Cross-modal Cross-lingual Pre-training
    Song, Yuqing
    Chen, Shizhe
    Jin, Qin
    Luo, Wei
    Xie, Jun
    Huang, Fei
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 2843 - 2852
  • [22] An analysis on language transfer of pre-trained language model with cross-lingual post-training
    Son, Suhyune
    Park, Chanjun
    Lee, Jungseob
    Shim, Midan
    Lee, Chanhee
    Jang, Yoonna
    Seo, Jaehyung
    Lim, Jungwoo
    Lim, Heuiseok
    EXPERT SYSTEMS WITH APPLICATIONS, 2025, 267
  • [23] Unifying Cross-Lingual and Cross-Modal Modeling Towards Weakly Supervised Multilingual Vision-Language Pre-training
    Li, Zejun
    Fan, Zhihao
    Chen, JingJing
    Zhang, Qi
    Huang, Xuanjing
    Wei, Zhongyu
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 5939 - 5958
  • [24] PTEKC: pre-training with event knowledge of ConceptNet for cross-lingual event causality identification
    Zhu, Enchang
    Yu, Zhengtao
    Huang, Yuxin
    Gao, Shengxiang
    Xian, Yantuan
    INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2025, 16 (03) : 1859 - 1872
  • [25] Cross-lingual Language Model Pretraining
    Conneau, Alexis
    Lample, Guillaume
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [26] Few-Shot Cross-Lingual Stance Detection with Sentiment-Based Pre-training
    Hardalov, Momchil
    Arora, Arnav
    Nakov, Preslav
    Augenstein, Isabelle
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 10729 - 10737
  • [27] Contrastive pre-training and instruction tuning for cross-lingual aspect-based sentiment analysis
    Zhao, Wenwen
    Yang, Zhisheng
    Yu, Song
    Zhu, Shiyu
    Li, Li
    APPLIED INTELLIGENCE, 2025, 55 (05)
  • [28] Cross-Lingual Pre-Training Based Transfer for Zero-Shot Neural Machine Translation
    Ji, Baijun
    Zhang, Zhirui
    Duan, Xiangyu
    Zhang, Min
    Chen, Boxing
    Luo, Weihua
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 115 - 122
  • [29] Investigating cross-lingual training for offensive language detection
    Pelicon, Andraz
    Shekhar, Ravi
    Skrlj, Blaz
    Purver, Matthew
    Pollak, Senja
    PEERJ COMPUTER SCIENCE, 2021, 7 : 2 - 39
  • [30] Language Anisotropic Cross-Lingual Model Editing
    Xu, Yang
    Hou, Yutai
    Che, Wanxiang
    Zhang, Min
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 5554 - 5569