Monolingual and Cross-Lingual Knowledge Transfer for Topic Classification

被引:0
|
作者
D. Karpov [1 ]
M. Burtsev [2 ]
机构
[1] Moscow Institute of Physics and Technology,
[2] London Institute for Mathematical Sciences,undefined
关键词
D O I
10.1007/s10958-024-07421-5
中图分类号
学科分类号
摘要
In this work, we investigate knowledge transfer from the RuQTopics dataset. This Russian topical dataset combines a large number of data points (361, 560 single-label, 170, 930 multi-label) with extensive class coverage (76 classes). We have prepared this dataset from the “Yandex Que” raw data. By evaluating the models trained on RuQTopics on the six matching classes from the Russian MASSIVE subset, we show that the RuQTopics dataset is suitable for real-world conversational tasks, as Russian-only models trained on this dataset consistently yield an accuracy around 85% on this subset. We have also found that for the multilingual BERT trained on RuQTopics and evaluated on the same six classes of MASSIVE (for all MASSIVE languages), the language-wise accuracy closely correlates (Spearman correlation 0.773 with p-value 2.997e − 11) with the approximate size of BERT pretraining data for the corresponding language. At the same time, the correlation of language-wise accuracy with the linguistic distance from the Russian language is not statistically significant.
引用
收藏
页码:36 / 48
页数:12
相关论文
共 50 条
  • [31] Incorporating Word Embedding into Cross-lingual Topic Modeling
    Chang, Chia-Hsuan
    Hwang, San-Yih
    Xui, Tou-Hsiang
    2018 IEEE INTERNATIONAL CONGRESS ON BIG DATA (IEEE BIGDATA CONGRESS), 2018, : 17 - 24
  • [32] An Integrated Topic Modelling and Graph Neural Network for Improving Cross-lingual Text Classification
    Tham Vo
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (01)
  • [33] Cross-lingual transfer of knowledge in distributional language models: Experiments in Hungarian
    Novak, Attila
    Novak, Borbala
    ACTA LINGUISTICA ACADEMICA, 2022, 69 (04): : 405 - 449
  • [34] UNSUPERVISED CROSS-LINGUAL KNOWLEDGE TRANSFER IN DNN-BASED LVCSR
    Swietojanski, Pawel
    Ghoshal, Arnab
    Renals, Steve
    2012 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2012), 2012, : 246 - 251
  • [35] CROSS-LINGUAL TOPIC PREDICTION FOR SPEECH USING TRANSLATIONS
    Bansal, Sameer
    Kamper, Herman
    Lopez, Adam
    Goldwater, Sharon
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 8164 - 8168
  • [36] Analyzing the Evaluation of Cross-Lingual Knowledge Transfer in Multilingual Language Models
    Rajaee, Sara
    Monz, Christof
    PROCEEDINGS OF THE 18TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 2895 - 2914
  • [37] A Multi-media Approach to Cross-lingual Entity Knowledge Transfer
    Lu, Di
    Pan, Xiaoman
    Pourdamghani, Nima
    Chang, Shih-Fu
    Ji, Heng
    Knight, Kevin
    PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, 2016, : 54 - 65
  • [38] Model Selection for Cross-Lingual Transfer
    Chen, Yang
    Ritter, Alan
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 5675 - 5687
  • [39] Linguistic Resources for Entity Linking Evaluation: from Monolingual to Cross-lingual
    Li, Xuansong
    Strassel, Stephanie M.
    Ji, Heng
    Griffitt, Kira
    Ellis, Joe
    LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 3098 - 3105
  • [40] Unsupervised Cross-Lingual Information Retrieval Using Monolingual Data Only
    Litschko, Robert
    Glavas, Goran
    Ponzetto, Simone Paolo
    Vulic, Ivan
    ACM/SIGIR PROCEEDINGS 2018, 2018, : 1253 - 1256