LAPCA: Language-Agnostic Pretraining with Cross-Lingual Alignment

被引:1
|
作者
Abulkhanov, Dmitry [1 ]
Sorokin, Nikita [1 ]
Nikolenko, Sergey [2 ,3 ]
Malykh, Valentin [1 ]
机构
[1] Huawei Noahs Ark Lab, Moscow, Russia
[2] RAS, Ivannikov Inst Syst Programming, Moscow, Russia
[3] RAS, Steklov Inst Math, St Petersburg Dept, St Petersburg, Russia
关键词
cross-lingual IR; multilingual IR; Transformer-based architectures;
D O I
10.1145/3539618.3592006
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Data collection and mining is a crucial bottleneck for cross-lingual information retrieval (CLIR). While previous works used machine translation and iterative training, we present a novel approach to cross-lingual pretraining called LAPCA (language-agnostic pretraining with cross-lingual alignment). We train the LAPCA-LM model based on XLM-RoBERTa and LAPCA that significantly improves cross-lingual knowledge transfer for question answering and sentence retrieval on, e.g., XOR-TyDi and Mr. TyDi datasets, and in the zero-shot cross-lingual scenario performs on par with supervised methods, outperforming many of them on MKQA.
引用
收藏
页码:2098 / 2102
页数:5
相关论文
共 50 条
  • [1] Cross-lingual Language Model Pretraining
    Conneau, Alexis
    Lample, Guillaume
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [2] Language-Agnostic Representation from Multilingual Sentence Encoders for Cross-Lingual Similarity Estimation
    Tiyajamorn, Nattapong
    Kajiwara, Tomoyuki
    Arase, Yuki
    Onizuka, Makoto
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 7764 - 7774
  • [3] Cross-lingual Language Model Pretraining for Retrieval
    Yu, Puxuan
    Fei, Hongliang
    Li, Ping
    PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE 2021 (WWW 2021), 2021, : 1029 - 1039
  • [4] Language Agnostic Speaker Embedding for Cross-Lingual Personalized Speech Generation
    Zhou, Yi
    Tian, Xiaohai
    Li, Haizhou
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 3427 - 3439
  • [5] Structural Contrastive Pretraining for Cross-Lingual Comprehension
    Chen, Nuo
    Shou, Linjun
    Song, Tengtao
    Gong, Ming
    Pei, Jian
    Chang, Jianhui
    Jiang, Daxin
    Li, Jia
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 2042 - 2057
  • [6] Cross-lingual Spoken Language Understanding with Regularized Representation Alignment
    Liu, Zihan
    Winata, Genta Indra
    Xu, Peng
    Lin, Zhaojiang
    Fung, Pascale
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 7241 - 7251
  • [7] Cross-lingual Cross-modal Pretraining for Multimodal Retrieval
    Fei, Hongliang
    Yu, Tan
    Li, Ping
    2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 3644 - 3650
  • [8] END-to-END Cross-Lingual Spoken Language Understanding Model with Multilingual Pretraining
    Zhang, Xianwei
    He, Liang
    INTERSPEECH 2021, 2021, : 4728 - 4732
  • [9] Detecting Hate Speech in Cross-Lingual and Multi-lingual Settings Using Language Agnostic Representations
    Rodriguez, Sebastian E.
    Allende-Cid, Hector
    Allende, Hector
    PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS, COMPUTER VISION, AND APPLICATIONS, CIARP 2021, 2021, 12702 : 77 - 87
  • [10] Cross-lingual morphological inflection with explicit alignment
    Coltekin, Cagri
    16TH SIGMORPHON WORKSHOP ON COMPUTATIONAL RESEARCH IN PHONETICS PHONOLOGY, AND MORPHOLOGY (SIGMORPHON 2019), 2019, : 71 - 79