Ensemble pretrained language models to extract biomedical knowledge from literature

被引:5
|
作者
Li, Zhao [1 ]
Wei, Qiang [1 ]
Huang, Liang-Chin [1 ]
Li, Jianfu [1 ]
Hu, Yan [1 ]
Chuang, Yao-Shun [1 ]
He, Jianping [1 ]
Das, Avisha [1 ]
Keloth, Vipina Kuttichi [2 ]
Yang, Yuntao [1 ]
Diala, Chiamaka S. [1 ]
Roberts, Kirk E. [1 ]
Tao, Cui [1 ]
Jiang, Xiaoqian [1 ]
Zheng, W. Jim [1 ]
Xu, Hua [2 ,3 ]
机构
[1] Univ Texas Hlth Sci Ctr Houston, McWilliams Sch Biomed Informat, Houston, TX 77030 USA
[2] Yale Univ, Sch Med, Sect Biomed Informat & Data Sci, New Haven, CT 06510 USA
[3] Yale Univ, Sch Med, Sect Biomed Informat & Data Sci, 100 Coll St, New Haven, CT 06510 USA
基金
美国国家卫生研究院;
关键词
named entity recognition; relation extraction; large language model; ensemble learning; knowledge base; RECOGNITION; NAME;
D O I
10.1093/jamia/ocae061
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Objectives The rapid expansion of biomedical literature necessitates automated techniques to discern relationships between biomedical concepts from extensive free text. Such techniques facilitate the development of detailed knowledge bases and highlight research deficiencies. The LitCoin Natural Language Processing (NLP) challenge, organized by the National Center for Advancing Translational Science, aims to evaluate such potential and provides a manually annotated corpus for methodology development and benchmarking.Materials and Methods For the named entity recognition (NER) task, we utilized ensemble learning to merge predictions from three domain-specific models, namely BioBERT, PubMedBERT, and BioM-ELECTRA, devised a rule-driven detection method for cell line and taxonomy names and annotated 70 more abstracts as additional corpus. We further finetuned the T0pp model, with 11 billion parameters, to boost the performance on relation extraction and leveraged entites' location information (eg, title, background) to enhance novelty prediction performance in relation extraction (RE).Results Our pioneering NLP system designed for this challenge secured first place in Phase I-NER and second place in Phase II-relation extraction and novelty prediction, outpacing over 200 teams. We tested OpenAI ChatGPT 3.5 and ChatGPT 4 in a Zero-Shot setting using the same test set, revealing that our finetuned model considerably surpasses these broad-spectrum large language models.Discussion and Conclusion Our outcomes depict a robust NLP system excelling in NER and RE across various biomedical entities, emphasizing that task-specific models remain superior to generic large ones. Such insights are valuable for endeavors like knowledge graph development and hypothesis formulation in biomedical research.
引用
收藏
页码:1904 / 1911
页数:8
相关论文
共 50 条
  • [1] EPICURE :Ensemble Pretrained Models for Extracting Cancer Mutations from Literature
    Cao, Jiarun
    van Veen, Elke M.
    Peek, Niels
    Renehan, Andrew G.
    Ananiadou, Sophia
    2021 IEEE 34TH INTERNATIONAL SYMPOSIUM ON COMPUTER-BASED MEDICAL SYSTEMS (CBMS), 2021, : 461 - 467
  • [2] Developing Pretrained Language Models for Turkish Biomedical Domain
    Turkmen, Hazal
    Dikenelli, Oguz
    Eraslan, Cenk
    Calli, Mehmet Cem
    Ozbek, Suha Sureyya
    2022 IEEE 10TH INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS (ICHI 2022), 2022, : 597 - 598
  • [3] Multilingual Knowledge Graph Completion from Pretrained Language Models with Knowledge Constraints
    Song, Ran
    He, Shizhu
    Gao, Shengxiang
    Cai, Li
    Liu, Kang
    Yu, Zhengtao
    Zhao, Jun
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 7709 - 7721
  • [4] Eliciting Knowledge from Pretrained Language Models for Prototypical Prompt Verbalizer
    Wei, Yinyi
    Mo, Tong
    Jiang, Yongtao
    Li, Weiping
    Zhao, Wen
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2022, PT II, 2022, 13530 : 222 - 233
  • [5] KBioXLM: A Knowledge-anchored Biomedical Multilingual Pretrained Language Model
    Geng, Lei
    Yan, Xu
    Cao, Ziqiang
    Li, Juntao
    Li, Wenjie
    Li, Sujian
    Zhou, Xinjie
    Yang, Yang
    Zhang, Jun
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 11239 - 11250
  • [6] BertNet: Harvesting Knowledge Graphs with Arbitrary Relations from Pretrained Language Models
    Hao, Shibo
    Tan, Bowen
    Tang, Kaiwen
    Ni, Bin
    Shao, Xiyan
    Zhang, Hengzhe
    Xing, Eric P.
    Hu, Zhiting
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 5000 - 5015
  • [7] Pretrain-KGE: Learning Knowledge Representation from Pretrained Language Models
    Zhang, Zhiyuan
    Liu, Xiaoqian
    Zhang, Yi
    Su, Qi
    Sun, Xu
    He, Bin
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 259 - 266
  • [8] AMMU: A survey of transformer-based biomedical pretrained language models
    Kalyan, Katikapalli Subramanyam
    Rajasekharan, Ajit
    Sangeetha, Sivanesan
    JOURNAL OF BIOMEDICAL INFORMATICS, 2022, 126
  • [9] Multilingual LAMA: Investigating Knowledge in Multilingual Pretrained Language Models
    Kassner, Nora
    Dufter, Philipp
    Schutze, Hinrich
    16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 3250 - 3258
  • [10] Constructing Taxonomies from Pretrained Language Models
    Chen, Catherine
    Lin, Kevin
    Klein, Dan
    2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 4687 - 4700