Monolingual and Cross-Lingual Knowledge Transfer for Topic Classification

被引:0
|
作者
D. Karpov [1 ]
M. Burtsev [2 ]
机构
[1] Moscow Institute of Physics and Technology,
[2] London Institute for Mathematical Sciences,undefined
关键词
D O I
10.1007/s10958-024-07421-5
中图分类号
学科分类号
摘要
In this work, we investigate knowledge transfer from the RuQTopics dataset. This Russian topical dataset combines a large number of data points (361, 560 single-label, 170, 930 multi-label) with extensive class coverage (76 classes). We have prepared this dataset from the “Yandex Que” raw data. By evaluating the models trained on RuQTopics on the six matching classes from the Russian MASSIVE subset, we show that the RuQTopics dataset is suitable for real-world conversational tasks, as Russian-only models trained on this dataset consistently yield an accuracy around 85% on this subset. We have also found that for the multilingual BERT trained on RuQTopics and evaluated on the same six classes of MASSIVE (for all MASSIVE languages), the language-wise accuracy closely correlates (Spearman correlation 0.773 with p-value 2.997e − 11) with the approximate size of BERT pretraining data for the corresponding language. At the same time, the correlation of language-wise accuracy with the linguistic distance from the Russian language is not statistically significant.
引用
收藏
页码:36 / 48
页数:12
相关论文
共 50 条
  • [41] Multilingual, Cross-lingual, and Monolingual Speech Emotion Recognition on EmoFilm Dataset
    Atmaja, Bagus Tris
    Sasou, Akira
    2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 1019 - 1025
  • [42] TeacherSim: Cross-lingual Machine Translation Evaluation with Monolingual Embedding as Teacher
    Yang, Hao
    Zhang, Min
    Tao, Shimin
    Ma, Miaomiao
    Qin, Ying
    Wei, Daimeng
    2023 25TH INTERNATIONAL CONFERENCE ON ADVANCED COMMUNICATION TECHNOLOGY, ICACT, 2023, : 283 - 287
  • [43] Weakly Supervised Cross-lingual Semantic Relation Classification via Knowledge Distillation
    Vyas, Yogarshi
    Carpuat, Marine
    2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 5285 - 5296
  • [44] Cross-lingual Transfer for Text Classification with Dictionary-based Heterogeneous Graph
    Chairatanakul, Nuttapong
    Sriwatanasakdi, Noppayut
    Charoenphakdee, Nontawat
    Liu, Xin
    Murata, Tsuyoshi
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 1504 - 1517
  • [45] Cross-lingual thesaurus for multilingual knowledge management
    Yang, Christopher C.
    Wei, Chih-Ping
    Li, K. W.
    DECISION SUPPORT SYSTEMS, 2008, 45 (03) : 596 - 605
  • [46] A Comparative Study of Cross-Lingual Sentiment Classification
    Wan, Xiaojun
    2012 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE AND INTELLIGENT AGENT TECHNOLOGY (WI-IAT 2012), VOL 1, 2012, : 24 - 31
  • [47] Cross-lingual sentiment classification with stacked autoencoders
    Guangyou Zhou
    Zhiyuan Zhu
    Tingting He
    Xiaohua Tony Hu
    Knowledge and Information Systems, 2016, 47 : 27 - 44
  • [48] Measuring Catastrophic Forgetting in Cross-Lingual Classification: Transfer Paradigms and Tuning Strategies
    Koloski, Boshko
    Skrlj, Blaz
    Robnik-Sikonja, Marko
    Pollak, Senja
    IEEE ACCESS, 2025, 13 : 33509 - 33520
  • [49] Czech Dataset for Cross-lingual Subjectivity Classification
    Priban, Pavel
    Steinberger, Josef
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 1381 - 1391
  • [50] Cross-lingual sentiment classification with stacked autoencoders
    Zhou, Guangyou
    Zhu, Zhiyuan
    He, Tingting
    Hu, Xiaohua Tony
    KNOWLEDGE AND INFORMATION SYSTEMS, 2016, 47 (01) : 27 - 44