Ensemble pretrained language models to extract biomedical knowledge from literature

被引:5
|
作者
Li, Zhao [1 ]
Wei, Qiang [1 ]
Huang, Liang-Chin [1 ]
Li, Jianfu [1 ]
Hu, Yan [1 ]
Chuang, Yao-Shun [1 ]
He, Jianping [1 ]
Das, Avisha [1 ]
Keloth, Vipina Kuttichi [2 ]
Yang, Yuntao [1 ]
Diala, Chiamaka S. [1 ]
Roberts, Kirk E. [1 ]
Tao, Cui [1 ]
Jiang, Xiaoqian [1 ]
Zheng, W. Jim [1 ]
Xu, Hua [2 ,3 ]
机构
[1] Univ Texas Hlth Sci Ctr Houston, McWilliams Sch Biomed Informat, Houston, TX 77030 USA
[2] Yale Univ, Sch Med, Sect Biomed Informat & Data Sci, New Haven, CT 06510 USA
[3] Yale Univ, Sch Med, Sect Biomed Informat & Data Sci, 100 Coll St, New Haven, CT 06510 USA
基金
美国国家卫生研究院;
关键词
named entity recognition; relation extraction; large language model; ensemble learning; knowledge base; RECOGNITION; NAME;
D O I
10.1093/jamia/ocae061
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Objectives The rapid expansion of biomedical literature necessitates automated techniques to discern relationships between biomedical concepts from extensive free text. Such techniques facilitate the development of detailed knowledge bases and highlight research deficiencies. The LitCoin Natural Language Processing (NLP) challenge, organized by the National Center for Advancing Translational Science, aims to evaluate such potential and provides a manually annotated corpus for methodology development and benchmarking.Materials and Methods For the named entity recognition (NER) task, we utilized ensemble learning to merge predictions from three domain-specific models, namely BioBERT, PubMedBERT, and BioM-ELECTRA, devised a rule-driven detection method for cell line and taxonomy names and annotated 70 more abstracts as additional corpus. We further finetuned the T0pp model, with 11 billion parameters, to boost the performance on relation extraction and leveraged entites' location information (eg, title, background) to enhance novelty prediction performance in relation extraction (RE).Results Our pioneering NLP system designed for this challenge secured first place in Phase I-NER and second place in Phase II-relation extraction and novelty prediction, outpacing over 200 teams. We tested OpenAI ChatGPT 3.5 and ChatGPT 4 in a Zero-Shot setting using the same test set, revealing that our finetuned model considerably surpasses these broad-spectrum large language models.Discussion and Conclusion Our outcomes depict a robust NLP system excelling in NER and RE across various biomedical entities, emphasizing that task-specific models remain superior to generic large ones. Such insights are valuable for endeavors like knowledge graph development and hypothesis formulation in biomedical research.
引用
收藏
页码:1904 / 1911
页数:8
相关论文
共 50 条
  • [21] Benchmarking Biomedical Relation Knowledge in Large Language Models
    Zhang, Fenghui
    Yang, Kuo
    Zhao, Chenqian
    Li, Haixu
    Dong, Xin
    Tian, Haoyu
    Zhou, Xuezhong
    BIOINFORMATICS RESEARCH AND APPLICATIONS, PT II, ISBRA 2024, 2024, 14955 : 482 - 495
  • [22] Geographic Adaptation of Pretrained Language Models
    Hofmann, Valentin
    Glavas, Goran
    Ljubesic, Nikola
    Pierrehumbert, Janet B.
    Schuetze, Hinrich
    TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2024, 12 : 411 - 431
  • [23] Generating Datasets with Pretrained Language Models
    Schick, Timo
    Schuetze, Hinrich
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 6943 - 6951
  • [24] From language models to large-scale food and biomedical knowledge graphs
    Gjorgjina Cenikj
    Lidija Strojnik
    Risto Angelski
    Nives Ogrinc
    Barbara Koroušić Seljak
    Tome Eftimov
    Scientific Reports, 13
  • [25] Investigating Transferability in Pretrained Language Models
    Tamkin, Alex
    Singh, Trisha
    Giovanardi, Davide
    Goodman, Noah
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 1393 - 1401
  • [26] Textually Pretrained Speech Language Models
    Hassid, Michael
    Remez, Tal
    Nguyen, Tu Anh
    Gat, Itai
    Conneau, Alexis
    Kreuk, Felix
    Copet, Jade
    Defossez, Alexandre
    Synnaeve, Gabriel
    Dupoux, Emmanuel
    Schwartz, Roy
    Adi, Yossi
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [27] From language models to large-scale food and biomedical knowledge graphs
    Cenikj, Gjorgjina
    Strojnik, Lidija
    Angelski, Risto
    Ogrinc, Nives
    Seljak, Barbara Korousic
    Eftimov, Tome
    SCIENTIFIC REPORTS, 2023, 13 (01)
  • [28] Discourse Probing of Pretrained Language Models
    Koto, Fajri
    Lau, Jey Han
    Baldwin, Timothy
    2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 3849 - 3864
  • [29] Adaptive Ensemble Self-Distillation With Consistent Gradients for Fast Inference of Pretrained Language Models
    Kong, Jun
    Wang, Jin
    Zhang, Xuejie
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 430 - 442
  • [30] Unsupervised Paraphrasing with Pretrained Language Models
    Niu, Tong
    Yavuz, Semih
    Zhou, Yingbo
    Keskar, Nitish Shirish
    Wang, Huan
    Xiong, Caiming
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 5136 - 5150