On the Impact of Cross-Domain Data on German Language Models

被引:0
|
作者
Dada, Amin [1 ]
Chen, Aokun [2 ,3 ]
Peng, Cheng [2 ,3 ]
Smith, Kaleb E. [4 ]
Idrissi-Yaghir, Ahmad [5 ,6 ]
Seibold, Constantin Marc [1 ,7 ]
Li, Jianning [1 ]
Heiliger, Lars [1 ]
Friedrich, Christoph M. [5 ,6 ]
Truhn, Daniel [8 ]
Egger, Jan [1 ,9 ]
Bian, Jiang [2 ,3 ]
Kleesiek, Jens [1 ,9 ,10 ,11 ]
Wu, Yonghui [2 ,3 ]
机构
[1] Univ Hosp Essen AoR, Inst AI Med IKIM, Essen, Germany
[2] Univ Florida, Coll Med, Dept Hlth Outcomes & Biomed Informat, Gainesville, FL USA
[3] Univ Florida, Canc Informat & eHlth Core, Hlth Canc Ctr, Gainesville, FL USA
[4] NVIDIA, Santa Clara, CA USA
[5] Univ Appl Sci & Arts Dortmund, Dept Comp Sci, Dortmund, Germany
[6] Univ Hosp Essen AoR, Inst Med Informat Biometry & Epidemiol IMIBE, Essen, Germany
[7] Univ Hosp Essen AoR, Clin Nucl Med, Essen, Germany
[8] Univ Hosp RWTH Aachen, Dept Diagnost & Intervent Radiol, Aachen, Germany
[9] Univ Hosp Essen AoR, Canc Res Ctr Cologne Essen CCCE, West German Canc Ctr Essen, Essen, Germany
[10] German Canc Consortium DKTK, Partner Site Essen, Heidelberg, Germany
[11] TU Dortmund, Dept Phys, Dortmund, Germany
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Traditionally, large language models have been either trained on general web crawls or domain-specific data. However, recent successes of generative large language models, have shed light on the benefits of cross-domain datasets. To examine the significance of prioritizing data diversity over quality, we present a German dataset comprising texts from five domains, along with another dataset aimed at containing high-quality data. Through training a series of models ranging between 122M and 750M parameters on both datasets, we conduct a comprehensive benchmark on multiple downstream tasks. Our findings demonstrate that the models trained on the cross-domain dataset outperform those trained on quality data alone, leading to improvements up to 4.45% over the previous state-of-the-art. The models are available at: https://huggingface.co/ikim-uk-essen
引用
收藏
页码:13801 / 13813
页数:13
相关论文
共 50 条
  • [31] Examining the impact of cross-domain learning on crime prediction
    Bappee, Fateha Khanam
    Soares, Amilcar
    Petry, Lucas May
    Matwin, Stan
    JOURNAL OF BIG DATA, 2021, 8 (01)
  • [32] Towards Robustness of Large Language Models on Text-to-SQL Task: An Adversarial and Cross-Domain Investigation
    Zhang, Weixu
    Wang, Yu
    Fan, Ming
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT V, 2023, 14258 : 181 - 192
  • [33] Cross-Domain Data Traceability Mechanism Based on Blockchain
    Zhao, Shoucai
    Cao, Lifeng
    Li, Jinhui
    Wan, Jiling
    Bai, Jinlong
    CMC-COMPUTERS MATERIALS & CONTINUA, 2023, 76 (02): : 2531 - 2549
  • [34] Identifying intentions in forum posts with cross-domain data
    Tu Minh Phuong
    Le Cong Linh
    Ngo Xuan Bach
    Journal of Heuristics, 2022, 28 : 171 - 192
  • [35] ADDRESSING UNCERTAINTY AND CONFLICTS IN CROSS-DOMAIN DATA PROVENANCE
    Moitra, Abha
    Barnett, Bruce
    Crapo, Andrew
    Dill, Stephen J.
    MILITARY COMMUNICATIONS CONFERENCE, 2010 (MILCOM 2010), 2010, : 912 - 917
  • [36] Data Loss Prevention for Cross-Domain Instant Messaging
    Kongsgard, Kyrre Wahl
    Nordbotten, Nils Agne
    Mancini, Federico
    Engelstad, Paal E.
    2017 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (SSCI), 2017, : 3565 - 3572
  • [37] Data Augmentation for Cross-Domain Named Entity Recognition
    Chen, Shuguang
    Aguilar, Gustavo
    Neves, Leonardo
    Solorio, Thamar
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 5346 - 5356
  • [38] Identifying intentions in forum posts with cross-domain data
    Tu Minh Phuong
    Le Cong Linh
    Ngo Xuan Bach
    JOURNAL OF HEURISTICS, 2022, 28 (02) : 171 - 192
  • [39] A Cross-Domain Comparative Study of Big Data Architectures
    Macak, Martin
    Ge, Mouzhi
    Buhnova, Barbora
    INTERNATIONAL JOURNAL OF COOPERATIVE INFORMATION SYSTEMS, 2020, 29 (04)
  • [40] The Research on Key Techniques of Cross-Domain Data Services
    Yin, Xinming
    Jiang, Haiping
    Huang, Haiye
    Bi, Junhao
    Cao, Zhiwei
    PROCEEDINGS OF THE 2017 INTERNATIONAL CONFERENCE ON MECHANICAL, ELECTRONIC, CONTROL AND AUTOMATION ENGINEERING (MECAE 2017), 2017, 61 : 398 - 402