Monolingual and Cross-Lingual Knowledge Transfer for Topic Classification

被引:0
|
作者
D. Karpov [1 ]
M. Burtsev [2 ]
机构
[1] Moscow Institute of Physics and Technology,
[2] London Institute for Mathematical Sciences,undefined
关键词
D O I
10.1007/s10958-024-07421-5
中图分类号
学科分类号
摘要
In this work, we investigate knowledge transfer from the RuQTopics dataset. This Russian topical dataset combines a large number of data points (361, 560 single-label, 170, 930 multi-label) with extensive class coverage (76 classes). We have prepared this dataset from the “Yandex Que” raw data. By evaluating the models trained on RuQTopics on the six matching classes from the Russian MASSIVE subset, we show that the RuQTopics dataset is suitable for real-world conversational tasks, as Russian-only models trained on this dataset consistently yield an accuracy around 85% on this subset. We have also found that for the multilingual BERT trained on RuQTopics and evaluated on the same six classes of MASSIVE (for all MASSIVE languages), the language-wise accuracy closely correlates (Spearman correlation 0.773 with p-value 2.997e − 11) with the approximate size of BERT pretraining data for the corresponding language. At the same time, the correlation of language-wise accuracy with the linguistic distance from the Russian language is not statistically significant.
引用
收藏
页码:36 / 48
页数:12
相关论文
共 50 条
  • [21] Cross-lingual distillation for domain knowledge transfer with sentence transformers
    Piperno, Ruben
    Bacco, Luca
    Dell'Orletta, Felice
    Merone, Mario
    Pecchia, Leandro
    KNOWLEDGE-BASED SYSTEMS, 2025, 311
  • [22] Coarse Alignment of Topic and Sentiment: A Unified Model for Cross-Lingual Sentiment Classification
    Wang, Deqing
    Jing, Baoyu
    Lu, Chenwei
    Wu, Junjie
    Liu, Guannan
    Du, Chenguang
    Zhuang, Fuzhen
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2021, 32 (02) : 736 - 747
  • [23] Monolingual and Cross-Lingual Acceptability Judgments with the Italian CoLA corpus
    Trotta, Daniela
    Guarasci, Raffaele
    Leonardelli, Elisa
    Tonelli, Sara
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 2929 - 2940
  • [24] Cross-lingual Evidence Improves Monolingual Fake News Detection
    Dementieva, Daryna
    Panchenko, Alexander
    ACL-IJCNLP 2021: THE 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING: PROCEEDINGS OF THE STUDENT RESEARCH WORKSHOP, 2021, : 310 - 320
  • [25] A Framework for the Construction of Monolingual and Cross-lingual Word Similarity Datasets
    Camacho-Collados, Jose
    Pilehvar, Mohammad Taher
    Navigli, Roberto
    PROCEEDINGS OF THE 53RD ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL) AND THE 7TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (IJCNLP), VOL 2, 2015, : 1 - 7
  • [26] GreenPLM: Cross-Lingual Transfer of Monolingual Pre-Trained Language Models at Almost No Cost
    Zeng, Qingcheng
    Garay, Lucas
    Zhou, Peilin
    Chong, Dading
    Hua, Yining
    Wu, Jiageng
    Pan, Yikang
    Zhou, Han
    Voigt, Rob
    Yang, Jie
    PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 6290 - 6298
  • [27] Conversations Powered by Cross-Lingual Knowledge
    Sun, Weiwei
    Meng, Chuan
    Meng, Qi
    Ren, Zhaochun
    Ren, Pengjie
    Chen, Zhumin
    de Rijke, Maarten
    SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 1442 - 1451
  • [28] Cross-Lingual Classification of Crisis Data
    Khare, Prashant
    Burel, Gregoire
    Maynard, Diana
    Alani, Harith
    SEMANTIC WEB - ISWC 2018, PT I, 2018, 11136 : 617 - 633
  • [29] Cross-Lingual Web Spam Classification
    Garzo, Andras
    Daroczy, Balint
    Kiss, Tamas
    Siklosi, David
    Benczur, Andras A.
    PROCEEDINGS OF THE 22ND INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW'13 COMPANION), 2013, : 1149 - 1156
  • [30] Cross-lingual Distillation for Text Classification
    Xu, Ruochen
    Yang, Yiming
    PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, : 1415 - 1425