On the Impact of Cross-Domain Data on German Language Models

被引:0
|
作者
Dada, Amin [1 ]
Chen, Aokun [2 ,3 ]
Peng, Cheng [2 ,3 ]
Smith, Kaleb E. [4 ]
Idrissi-Yaghir, Ahmad [5 ,6 ]
Seibold, Constantin Marc [1 ,7 ]
Li, Jianning [1 ]
Heiliger, Lars [1 ]
Friedrich, Christoph M. [5 ,6 ]
Truhn, Daniel [8 ]
Egger, Jan [1 ,9 ]
Bian, Jiang [2 ,3 ]
Kleesiek, Jens [1 ,9 ,10 ,11 ]
Wu, Yonghui [2 ,3 ]
机构
[1] Univ Hosp Essen AoR, Inst AI Med IKIM, Essen, Germany
[2] Univ Florida, Coll Med, Dept Hlth Outcomes & Biomed Informat, Gainesville, FL USA
[3] Univ Florida, Canc Informat & eHlth Core, Hlth Canc Ctr, Gainesville, FL USA
[4] NVIDIA, Santa Clara, CA USA
[5] Univ Appl Sci & Arts Dortmund, Dept Comp Sci, Dortmund, Germany
[6] Univ Hosp Essen AoR, Inst Med Informat Biometry & Epidemiol IMIBE, Essen, Germany
[7] Univ Hosp Essen AoR, Clin Nucl Med, Essen, Germany
[8] Univ Hosp RWTH Aachen, Dept Diagnost & Intervent Radiol, Aachen, Germany
[9] Univ Hosp Essen AoR, Canc Res Ctr Cologne Essen CCCE, West German Canc Ctr Essen, Essen, Germany
[10] German Canc Consortium DKTK, Partner Site Essen, Heidelberg, Germany
[11] TU Dortmund, Dept Phys, Dortmund, Germany
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Traditionally, large language models have been either trained on general web crawls or domain-specific data. However, recent successes of generative large language models, have shed light on the benefits of cross-domain datasets. To examine the significance of prioritizing data diversity over quality, we present a German dataset comprising texts from five domains, along with another dataset aimed at containing high-quality data. Through training a series of models ranging between 122M and 750M parameters on both datasets, we conduct a comprehensive benchmark on multiple downstream tasks. Our findings demonstrate that the models trained on the cross-domain dataset outperform those trained on quality data alone, leading to improvements up to 4.45% over the previous state-of-the-art. The models are available at: https://huggingface.co/ikim-uk-essen
引用
收藏
页码:13801 / 13813
页数:13
相关论文
共 50 条
  • [1] Identification of Cross-domain Ambiguity with Language Models
    Ferrari, Alessio
    Esuli, Andrea
    Gnesi, Stefania
    2018 5TH INTERNATIONAL WORKSHOP ON ARTIFICIAL INTELLIGENCE FOR REQUIREMENTS ENGINEERING (AIRE 2018), 2018, : 31 - 38
  • [2] Cross-Domain NER using Cross-Domain Language Modeling
    Jia, Chen
    Liang, Xiaobo
    Zhang, Yue
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 2464 - 2474
  • [3] Cross-Domain Sentiment Classification With Bidirectional Contextualized Transformer Language Models
    Myagmar, Batsergelen
    Li, Jie
    Kimura, Shigetomo
    IEEE ACCESS, 2019, 7 : 163219 - 163230
  • [4] Instructing and Prompting Large Language Models for Explainable Cross-domain Recommendations
    Petruzzelli, Alessandro
    Musto, Cataldo
    Laraspata, Lucrezia
    Rinaldi, Ivan
    de Gemmis, Marco
    Lops, Pasquale
    Semeraro, Giovanni
    PROCEEDINGS OF THE EIGHTEENTH ACM CONFERENCE ON RECOMMENDER SYSTEMS, RECSYS 2024, 2024, : 298 - 308
  • [5] Cross-domain approaches to the language puzzle
    Ries, S.
    Fischer-Baum, S.
    51ST ACADEMY OF APHASIA PROCEEDINGS, 2013, 94 : 211 - 211
  • [6] Cross-Domain Data Fusion
    Yang, Qiang
    COMPUTER, 2016, 49 (04) : 18 - 18
  • [7] Adding transparency to the identification of cross-domain mappings in real language data
    Krennmayr, Tina
    REVIEW OF COGNITIVE LINGUISTICS, 2013, 11 (01): : 163 - 184
  • [8] Cross-domain Paraphrasing For Improving Language Modelling Using Out-of-domain Data
    Liu, X.
    Gales, M. J. F.
    Woodland, P. C.
    14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 3391 - 3395
  • [9] Cross-Domain Tibetan Named Entity Recognition via Large Language Models
    Zhang, Jin
    Gao, Fan
    Yeshi, Lobsang
    Tashi, Dorje
    Wang, Xiangshi
    Tashi, Nyima
    Luosang, Gadeng
    ELECTRONICS, 2025, 14 (01):
  • [10] Cross-Domain Authorship Attribution Using Pre-trained Language Models
    Barlas, Georgios
    Stamatatos, Efstathios
    ARTIFICIAL INTELLIGENCE APPLICATIONS AND INNOVATIONS, AIAI 2020, PT I, 2020, 583 : 255 - 266