On the Impact of Cross-Domain Data on German Language Models

被引:0
|
作者
Dada, Amin [1 ]
Chen, Aokun [2 ,3 ]
Peng, Cheng [2 ,3 ]
Smith, Kaleb E. [4 ]
Idrissi-Yaghir, Ahmad [5 ,6 ]
Seibold, Constantin Marc [1 ,7 ]
Li, Jianning [1 ]
Heiliger, Lars [1 ]
Friedrich, Christoph M. [5 ,6 ]
Truhn, Daniel [8 ]
Egger, Jan [1 ,9 ]
Bian, Jiang [2 ,3 ]
Kleesiek, Jens [1 ,9 ,10 ,11 ]
Wu, Yonghui [2 ,3 ]
机构
[1] Univ Hosp Essen AoR, Inst AI Med IKIM, Essen, Germany
[2] Univ Florida, Coll Med, Dept Hlth Outcomes & Biomed Informat, Gainesville, FL USA
[3] Univ Florida, Canc Informat & eHlth Core, Hlth Canc Ctr, Gainesville, FL USA
[4] NVIDIA, Santa Clara, CA USA
[5] Univ Appl Sci & Arts Dortmund, Dept Comp Sci, Dortmund, Germany
[6] Univ Hosp Essen AoR, Inst Med Informat Biometry & Epidemiol IMIBE, Essen, Germany
[7] Univ Hosp Essen AoR, Clin Nucl Med, Essen, Germany
[8] Univ Hosp RWTH Aachen, Dept Diagnost & Intervent Radiol, Aachen, Germany
[9] Univ Hosp Essen AoR, Canc Res Ctr Cologne Essen CCCE, West German Canc Ctr Essen, Essen, Germany
[10] German Canc Consortium DKTK, Partner Site Essen, Heidelberg, Germany
[11] TU Dortmund, Dept Phys, Dortmund, Germany
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Traditionally, large language models have been either trained on general web crawls or domain-specific data. However, recent successes of generative large language models, have shed light on the benefits of cross-domain datasets. To examine the significance of prioritizing data diversity over quality, we present a German dataset comprising texts from five domains, along with another dataset aimed at containing high-quality data. Through training a series of models ranging between 122M and 750M parameters on both datasets, we conduct a comprehensive benchmark on multiple downstream tasks. Our findings demonstrate that the models trained on the cross-domain dataset outperform those trained on quality data alone, leading to improvements up to 4.45% over the previous state-of-the-art. The models are available at: https://huggingface.co/ikim-uk-essen
引用
收藏
页码:13801 / 13813
页数:13
相关论文
共 50 条
  • [21] Data Poisoning Attacks on Cross-domain Recommendation
    Chen, Huiyuan
    Li, Jing
    PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM '19), 2019, : 2177 - 2180
  • [22] Cross-Domain Learning in Deep HAR Models via Natural Language Processing on Action Labels
    Bacharidis, Konstantinos
    Argyros, Antonis
    ADVANCES IN VISUAL COMPUTING, ISVC 2022, PT I, 2022, 13598 : 347 - 361
  • [23] Cross-Domain Labeled LDA for Cross-Domain Text Classification
    Jing, Baoyu
    Lu, Chenwei
    Wang, Deqing
    Zhuang, Fuzhen
    Niu, Cheng
    2018 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2018, : 187 - 196
  • [24] Cross-Domain Data Augmentation with Domain-Adaptive Language Modeling for Aspect-Based Sentiment Analysis
    Yu, Jianfei
    Zhao, Qiankun
    Xia, Rui
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 1456 - 1470
  • [25] A Review of Cross-Domain Text-to-SQL Models
    Gan, Yujian
    Purver, Matthew
    Woodward, John R.
    AACL-IJCNLP 2020: THE 1ST CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 10TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING: PROCEEDINGS OF THE STUDENT RESEARCH WORKSHOP, 2020, : 101 - 108
  • [26] Cross-domain correlation in pitch perception, the influence of native language
    Chen, Ao
    Liu, Liquan
    Kager, Rene
    LANGUAGE COGNITION AND NEUROSCIENCE, 2016, 31 (06) : 751 - 760
  • [27] DocSpider: a dataset of cross-domain natural language querying for MongoDB
    Ozer, Arif Gorkem
    Cekinel, Recep Firat
    Toroslu, Ismail Hakki
    Karagoz, Pinar
    NATURAL LANGUAGE PROCESSING, 2025,
  • [28] RegGPT: A Tool for Cross-Domain Service Regulation Language Conversion
    Wang, Zhaowen
    Xie, Qi
    Zhang, Huan
    Min, Weihuan
    Kuang, Li
    Zhang, Lingyan
    2024 IEEE INTERNATIONAL CONFERENCE ON WEB SERVICES, ICWS 2024, 2024, : 416 - 425
  • [29] Transferring Cross-domain Knowledge for Video Sign Language Recognition
    Li, Dongxu
    Yu, Xin
    Xu, Chenchen
    Petersson, Lars
    Li, Hongdong
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 6204 - 6213
  • [30] Examining the impact of cross-domain learning on crime prediction
    Fateha Khanam Bappee
    Amilcar Soares
    Lucas May Petry
    Stan Matwin
    Journal of Big Data, 8