Monolingual and Cross-Lingual Knowledge Transfer for Topic Classification

被引:0
|
作者
D. Karpov [1 ]
M. Burtsev [2 ]
机构
[1] Moscow Institute of Physics and Technology,
[2] London Institute for Mathematical Sciences,undefined
关键词
D O I
10.1007/s10958-024-07421-5
中图分类号
学科分类号
摘要
In this work, we investigate knowledge transfer from the RuQTopics dataset. This Russian topical dataset combines a large number of data points (361, 560 single-label, 170, 930 multi-label) with extensive class coverage (76 classes). We have prepared this dataset from the “Yandex Que” raw data. By evaluating the models trained on RuQTopics on the six matching classes from the Russian MASSIVE subset, we show that the RuQTopics dataset is suitable for real-world conversational tasks, as Russian-only models trained on this dataset consistently yield an accuracy around 85% on this subset. We have also found that for the multilingual BERT trained on RuQTopics and evaluated on the same six classes of MASSIVE (for all MASSIVE languages), the language-wise accuracy closely correlates (Spearman correlation 0.773 with p-value 2.997e − 11) with the approximate size of BERT pretraining data for the corresponding language. At the same time, the correlation of language-wise accuracy with the linguistic distance from the Russian language is not statistically significant.
引用
收藏
页码:36 / 48
页数:12
相关论文
共 50 条
  • [1] Cross-lingual Transfer of Monolingual Models
    Gogoulou, Evangelia
    Ekgren, Ariel
    Isbister, Tim
    Sahlgren, Magnus
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 948 - 955
  • [2] Monolingual, multilingual and cross-lingual code comment classification
    Kostic, Marija
    Batanovic, Vuk
    Nikolic, Bosko
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2023, 124
  • [3] Can Monolingual Pretrained Models Help Cross-Lingual Classification?
    Chi, Zewen
    Dong, Li
    Wei, Furu
    Mao, Xian-Ling
    Huang, Heyan
    1ST CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 10TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (AACL-IJCNLP 2020), 2020, : 12 - 17
  • [4] Contextualized Embeddings Encode Monolingual and Cross-lingual Knowledge of Idiomaticity
    Fakharian, Samin
    Cook, Paul
    MWE 2021: THE 17TH WORKSHOP ON MULTIWORD EXPRESSIONS, 2021, : 23 - 32
  • [5] Realistic Zero-Shot Cross-Lingual Transfer in Legal Topic Classification
    Xenouleas, Stratos
    Tsoukara, Alexia
    Panagiotakis, Giannis
    Chalkidis, Ilias
    Androutsopoulos, Ion
    PROCEEDINGS OF THE 12TH HELLENIC CONFERENCE ON ARTIFICIAL INTELLIGENCE, SETN 2022, 2022,
  • [6] On the Cross-lingual Transferability of Monolingual Representations
    Artetxe, Mikel
    Ruder, Sebastian
    Yogatama, Dani
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 4623 - 4637
  • [7] Cross-Lingual Knowledge Transfer for Clinical Phenotyping
    Papaioannou, Jens-Michalis
    Grundmann, Paul
    van Aken, Betty
    Samaras, Athanasios
    Kyparissidis, Ilias
    Giannakoulas, George
    Gers, Felix
    Loeser, Alexander
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 900 - 909
  • [8] An Unsupervised Cross-Lingual Topic Model Framework for Sentiment Classification
    Lin, Zheng
    Jin, Xiaolong
    Xu, Xueke
    Wang, Yuanzhuo
    Cheng, Xueqi
    Wang, Weiping
    Meng, Dan
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2016, 24 (03) : 432 - 444
  • [9] BERT for Monolingual and Cross-Lingual Reverse Dictionary
    Yan, Hang
    Li, Xiaonan
    Qiu, Xipeng
    Deng, Bocao
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 4329 - 4338
  • [10] Cross-Lingual Latent Topic Extraction
    Zhang, Duo
    Mei, Qiaozhu
    Zhai, ChengXiang
    ACL 2010: 48TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2010, : 1128 - 1137