Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training

被引:0
|
作者
Zheng, Bo [1 ,2 ]
Dong, Li [2 ]
Huang, Shaohan [2 ]
Singhal, Saksham [2 ]
Che, Wanxiang [1 ]
Liu, Ting [1 ]
Song, Xia [2 ]
Wei, Furu [2 ]
机构
[1] Harbin Inst Technol, Harbin, Peoples R China
[2] Microsoft Corp, Redmond, WA 98052 USA
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Compared to monolingual models, cross-lingual models usually require a more expressive vocabulary to represent all languages adequately. We find that many languages are under-represented in recent cross-lingual language models due to the limited vocabulary capacity. To this end, we propose an algorithm VoCAP to determine the desired vocabulary capacity of each language. However, increasing the vocabulary size significantly slows down the pre-training speed. In order to address the issues, we propose k-NN-based target sampling to accelerate the expensive softmax. Our experiments show that the multilingual vocabulary learned with VoCAP benefits cross-lingual language model pre-training Moreover, k-NN-based target sampling mitigates the side-effects of increasing the vocabulary size while achieving comparable performance and faster pre-training speed. The code and the pretrained multilingual vocabularies are available at https://github.com/bozheng-hit/VoCapXLM.
引用
收藏
页码:3203 / 3215
页数:13
相关论文
共 50 条
  • [31] Cross-lingual Language Model Pretraining for Retrieval
    Yu, Puxuan
    Fei, Hongliang
    Li, Ping
    PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE 2021 (WWW 2021), 2021, : 1029 - 1039
  • [32] Cross-Lingual Knowledge Editing in Large Language Models
    Wang, Jiaan
    Liang, Yunlong
    Sun, Zengkui
    Cao, Yuxuan
    Xu, Jiarong
    Meng, Fandong
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 11676 - 11686
  • [33] EMMA- X: An EM-like Multilingual Pre-training Algorithm for Cross-lingual Representation Learning
    Guo, Ping
    Wei, Xiangpeng
    Hu, Yue
    Yang, Baosong
    Liu, Dayiheng
    Huang, Fei
    Xie, Jun
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [34] Language Model Priming for Cross-Lingual Event Extraction
    Fincke, Steven
    Agarwal, Shantanu
    Miller, Scott
    Boschee, Elizabeth
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 10627 - 10635
  • [35] Steering Large Language Models for Cross-lingual Information Retrieval
    Guo, Ping
    Ren, Yubing
    Hu, Yue
    Cao, Yanan
    Li, Yunpeng
    Huang, Heyan
    PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 585 - 596
  • [36] Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language Model
    Li, Juntao
    He, Ruidan
    Ye, Hai
    Ng, Hwee Tou
    Bing, Lidong
    Yan, Rui
    PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 3672 - 3678
  • [37] Large-scale Cross-lingual Language Resources for Referencing and Framing
    Vossen, Piek
    Ilievski, Filip
    Postma, Marten
    Fokkens, Antske
    Minnema, Gosse
    Remijnse, Levi
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 3162 - 3171
  • [38] MindLLM: Lightweight large language model pre-training, evaluation and domain application
    Yang, Yizhe
    Sun, Huashan
    Li, Jiawei
    Liu, Runheng
    Li, Yinghao
    Liu, Yuhang
    Gao, Yang
    Huang, Heyan
    AI OPEN, 2024, 5 : 155 - 180
  • [39] Cross-lingual training of summarization systems using annotated corpora in a foreign language
    Litvak, Marina
    Last, Mark
    INFORMATION RETRIEVAL, 2013, 16 (05): : 629 - 656
  • [40] Cross-lingual training of summarization systems using annotated corpora in a foreign language
    Marina Litvak
    Mark Last
    Information Retrieval, 2013, 16 : 629 - 656