Ensemble pretrained language models to extract biomedical knowledge from literature

被引:5
|
作者
Li, Zhao [1 ]
Wei, Qiang [1 ]
Huang, Liang-Chin [1 ]
Li, Jianfu [1 ]
Hu, Yan [1 ]
Chuang, Yao-Shun [1 ]
He, Jianping [1 ]
Das, Avisha [1 ]
Keloth, Vipina Kuttichi [2 ]
Yang, Yuntao [1 ]
Diala, Chiamaka S. [1 ]
Roberts, Kirk E. [1 ]
Tao, Cui [1 ]
Jiang, Xiaoqian [1 ]
Zheng, W. Jim [1 ]
Xu, Hua [2 ,3 ]
机构
[1] Univ Texas Hlth Sci Ctr Houston, McWilliams Sch Biomed Informat, Houston, TX 77030 USA
[2] Yale Univ, Sch Med, Sect Biomed Informat & Data Sci, New Haven, CT 06510 USA
[3] Yale Univ, Sch Med, Sect Biomed Informat & Data Sci, 100 Coll St, New Haven, CT 06510 USA
基金
美国国家卫生研究院;
关键词
named entity recognition; relation extraction; large language model; ensemble learning; knowledge base; RECOGNITION; NAME;
D O I
10.1093/jamia/ocae061
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Objectives The rapid expansion of biomedical literature necessitates automated techniques to discern relationships between biomedical concepts from extensive free text. Such techniques facilitate the development of detailed knowledge bases and highlight research deficiencies. The LitCoin Natural Language Processing (NLP) challenge, organized by the National Center for Advancing Translational Science, aims to evaluate such potential and provides a manually annotated corpus for methodology development and benchmarking.Materials and Methods For the named entity recognition (NER) task, we utilized ensemble learning to merge predictions from three domain-specific models, namely BioBERT, PubMedBERT, and BioM-ELECTRA, devised a rule-driven detection method for cell line and taxonomy names and annotated 70 more abstracts as additional corpus. We further finetuned the T0pp model, with 11 billion parameters, to boost the performance on relation extraction and leveraged entites' location information (eg, title, background) to enhance novelty prediction performance in relation extraction (RE).Results Our pioneering NLP system designed for this challenge secured first place in Phase I-NER and second place in Phase II-relation extraction and novelty prediction, outpacing over 200 teams. We tested OpenAI ChatGPT 3.5 and ChatGPT 4 in a Zero-Shot setting using the same test set, revealing that our finetuned model considerably surpasses these broad-spectrum large language models.Discussion and Conclusion Our outcomes depict a robust NLP system excelling in NER and RE across various biomedical entities, emphasizing that task-specific models remain superior to generic large ones. Such insights are valuable for endeavors like knowledge graph development and hypothesis formulation in biomedical research.
引用
收藏
页码:1904 / 1911
页数:8
相关论文
共 50 条
  • [41] Unsupervised and few-shot parsing from pretrained language models
    Zeng, Zhiyuan
    Xiong, Deyi
    ARTIFICIAL INTELLIGENCE, 2022, 305
  • [42] Controlling the Focus of Pretrained Language Generation Models
    Ji, Jiabao
    Kim, Yoon
    Glass, James
    He, Tianxing
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 3291 - 3306
  • [43] Language Recognition Based on Unsupervised Pretrained Models
    Yu, Haibin
    Zhao, Jing
    Yang, Song
    Wu, Zhongqin
    Nie, Yuting
    Zhang, Wei-Qiang
    INTERSPEECH 2021, 2021, : 3271 - 3275
  • [44] Fooling MOSS Detection with Pretrained Language Models
    Biderman, Stella
    Raff, Edward
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2022, 2022, : 2933 - 2943
  • [45] Factual Consistency of Multilingual Pretrained Language Models
    Fierro, Constanza
    Sogaard, Anders
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 3046 - 3052
  • [46] Pretrained Models and Evaluation Data for the Khmer Language
    Shengyi Jiang
    Sihui Fu
    Nankai Lin
    Yingwen Fu
    Tsinghua Science and Technology, 2022, 27 (04) : 709 - 718
  • [47] Pretrained Language Models for Text Generation: A Survey
    Li, Junyi
    Tang, Tianyi
    Zhao, Wayne Xin
    Wen, Ji-Rong
    PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021, 2021, : 4492 - 4499
  • [48] Pretrained Language Models for Sequential Sentence Classification
    Cohan, Arman
    Beltagy, Iz
    King, Daniel
    Dalvi, Bhavana
    Weld, Daniel S.
    2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 3693 - 3699
  • [49] Discovering hidden knowledge from biomedical literature
    University of Nova Gorica, Vipavska 13, 5000 Nova Gorica, Slovenia
    不详
    不详
    Informatica, 2007, 1 (15-20):
  • [50] Discovering hidden knowledge from biomedical literature
    Petrič, Ingrid
    Urbančič, Tanja
    Cestnik, Bojan
    Informatica (Ljubljana), 2007, 31 (01) : 15 - 20