On the Impact of Cross-Domain Data on German Language Models

被引:0
|
作者
Dada, Amin [1 ]
Chen, Aokun [2 ,3 ]
Peng, Cheng [2 ,3 ]
Smith, Kaleb E. [4 ]
Idrissi-Yaghir, Ahmad [5 ,6 ]
Seibold, Constantin Marc [1 ,7 ]
Li, Jianning [1 ]
Heiliger, Lars [1 ]
Friedrich, Christoph M. [5 ,6 ]
Truhn, Daniel [8 ]
Egger, Jan [1 ,9 ]
Bian, Jiang [2 ,3 ]
Kleesiek, Jens [1 ,9 ,10 ,11 ]
Wu, Yonghui [2 ,3 ]
机构
[1] Univ Hosp Essen AoR, Inst AI Med IKIM, Essen, Germany
[2] Univ Florida, Coll Med, Dept Hlth Outcomes & Biomed Informat, Gainesville, FL USA
[3] Univ Florida, Canc Informat & eHlth Core, Hlth Canc Ctr, Gainesville, FL USA
[4] NVIDIA, Santa Clara, CA USA
[5] Univ Appl Sci & Arts Dortmund, Dept Comp Sci, Dortmund, Germany
[6] Univ Hosp Essen AoR, Inst Med Informat Biometry & Epidemiol IMIBE, Essen, Germany
[7] Univ Hosp Essen AoR, Clin Nucl Med, Essen, Germany
[8] Univ Hosp RWTH Aachen, Dept Diagnost & Intervent Radiol, Aachen, Germany
[9] Univ Hosp Essen AoR, Canc Res Ctr Cologne Essen CCCE, West German Canc Ctr Essen, Essen, Germany
[10] German Canc Consortium DKTK, Partner Site Essen, Heidelberg, Germany
[11] TU Dortmund, Dept Phys, Dortmund, Germany
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Traditionally, large language models have been either trained on general web crawls or domain-specific data. However, recent successes of generative large language models, have shed light on the benefits of cross-domain datasets. To examine the significance of prioritizing data diversity over quality, we present a German dataset comprising texts from five domains, along with another dataset aimed at containing high-quality data. Through training a series of models ranging between 122M and 750M parameters on both datasets, we conduct a comprehensive benchmark on multiple downstream tasks. Our findings demonstrate that the models trained on the cross-domain dataset outperform those trained on quality data alone, leading to improvements up to 4.45% over the previous state-of-the-art. The models are available at: https://huggingface.co/ikim-uk-essen
引用
收藏
页码:13801 / 13813
页数:13
相关论文
共 50 条
  • [41] Cross-domain Constituency Parsing by Leveraging Heterogeneous Data
    Guo, Peiming
    Zhang, Meishan
    Chen, Yulong
    Li, Jianling
    Zhang, Min
    Zhang, Yue
    Journal of Artificial Intelligence Research, 2024, 81 : 771 - 791
  • [42] Standards Based Approaches for Cross-Domain Data Integration
    Atkinson, Rob
    Millard, Keiran
    Arctur, David
    INTERNATIONAL JOURNAL OF SPATIAL DATA INFRASTRUCTURES RESEARCH, 2007, 2 : 74 - 89
  • [43] Cross-domain structure learning for visual data recognition
    Lu, Yuwu
    Luo, Xingping
    Wen, Jiajun
    Lai, Zhihui
    Li, Xuelong
    PATTERN RECOGNITION, 2022, 134
  • [44] Cross-domain Constituency Parsing by Leveraging Heterogeneous Data
    Guo, Peiming
    Zhang, Meishan
    Chen, Yulong
    Li, Jianling
    Zhang, Min
    Zhang, Yue
    JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2024, 81 : 771 - 791
  • [45] Cross-chain data traceability mechanism for cross-domain access
    Lifeng Cao
    Shoucai Zhao
    ZhenSheng Gao
    Xuehui Du
    The Journal of Supercomputing, 2023, 79 : 4944 - 4961
  • [46] Cross-chain data traceability mechanism for cross-domain access
    Cao, Lifeng
    Zhao, Shoucai
    Gao, ZhenSheng
    Du, Xuehui
    JOURNAL OF SUPERCOMPUTING, 2023, 79 (05): : 4944 - 4961
  • [47] Cross-domain symbiosis
    Andrea Du Toit
    Nature Reviews Microbiology, 2022, 20 (11) : 638 - 638
  • [48] Cross-Domain Federated Data Modeling on Non-IID Data
    Chai, Baobao
    Liu, Kun
    Yang, Ruiping
    COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2022, 2022
  • [49] Cross-Domain Building Models-A Step towards Interoperability
    Knoth, Laura
    Scholz, Johannes
    Strobl, Josef
    Mittlboeck, Manfred
    Vockner, Bernhard
    Atzl, Caroline
    Rajabifard, Abbas
    Atazadeh, Behnam
    ISPRS INTERNATIONAL JOURNAL OF GEO-INFORMATION, 2018, 7 (09)
  • [50] Identifying the Limits of Cross-Domain Knowledge Transfer for Pretrained Models
    Wu, Zhengxuan
    Liu, Nelson F.
    Potts, Christopher
    PROCEEDINGS OF THE 7TH WORKSHOP ON REPRESENTATION LEARNING FOR NLP, 2022, : 100 - 110