An Automatic and End-to-End System for Rare Disease Knowledge Graph Construction Based on Ontology- Enhanced Large Language Models: Development Study

被引:0
|
作者
Cao, Lang [1 ]
Sun, Jimeng [1 ]
Cross, Adam [2 ]
机构
[1] Univ Illinois, Dept Comp Sci, Urbana, IL USA
[2] Univ Illinois, Coll Med Peoria, Dept Pediat, 1 Illini Dr, Peoria, IL 61605 USA
关键词
rare disease; clinical informatics; LLM; natural language processing; machine learning; artificial intelligence; large language models; data extraction; ontologies; knowledge graphs; text mining;
D O I
10.2196/60665
中图分类号
R-058 [];
学科分类号
摘要
Background: Rare diseases affect millions worldwide but sometimes face limited research focus individually due to low prevalence. Many rare diseases do not have specific International Classification of Diseases, Ninth Edition ( ICD-9 ) and Tenth Edition ( ICD-10 ), codes and therefore cannot be reliably extracted from granular fields like "Diagnosis" and "Problem List" entries, which complicates tasks that require identification of patients with these conditions, including clinical trial recruitment and research efforts. Recent advancements in large language models (LLMs) have shown promise in automating the extraction of medical information, offering the potential to improve medical research, diagnosis, and management. However, most LLMs lack professional medical knowledge, especially concerning specific rare diseases, and cannot effectively manage rare disease data in its various ontological forms, making it unsuitable for these tasks. Objective: Our aim is to create an end-to-end system called automated rare disease mining (AutoRD), which automates the extraction of rare disease-related information from medical text, focusing on entities and their relations to other medical concepts, such as signs and symptoms. AutoRD integrates up-to-date ontologies with other structured knowledge and demonstrates superior performance in rare disease extraction tasks. We conducted various experiments to evaluate AutoRD's performance, aiming to surpass common LLMs and traditional methods. Methods: AutoRD is a pipeline system that involves data preprocessing, entity extraction, relation extraction, entity calibration, and knowledge graph construction. We implemented this system using GPT-4 and medical knowledge graphs developed from the open-source Human Phenotype and Orphanet ontologies, using techniques such as chain-of-thought reasoning and prompt engineering. We quantitatively evaluated our system's performance in entity extraction, relation extraction, and knowledge graph construction. The experiment used the well-curated dataset RareDis2023, which contains medical literature focused on rare disease entities and their relations, making it an ideal dataset for training and testing our methodology. Results: On the RareDis2023 dataset, AutoRD achieved an overall entity extraction F 1-score of 56.1% and a relation extraction F 1-score of 38.6%, marking a 14.4% improvement over the baseline LLM. Notably, the F 1-score for rare disease entity extraction reached 83.5%, indicating high precision and recall in identifying rare disease mentions. These results demonstrate the effectiveness of integrating LLMs with medical ontologies in extracting complex rare disease information. Conclusions: AutoRD is an automated end-to-end system for extracting rare disease information from text to build knowledge graphs, addressing critical limitations of existing LLMs by improving identification of these diseases and connecting them to related clinical features. This work underscores the significant potential of LLMs in transforming health care, particularly in the rare disease domain. By leveraging ontology-enhanced LLMs, AutoRD constructs a robust medical knowledge base that incorporates up-to-date rare disease information, facilitating improved identification of patients and resulting in more inclusive research and trial candidacy efforts.
引用
收藏
页数:14
相关论文
共 15 条
  • [1] EasyKG: An End-to-End Knowledge Graph Construction System
    Jia, Yantao
    Liu, Dong
    Sheng, Zhicheng
    Feng, Letian
    Liu, Yi
    Guo, Shuo
    SEMANTIC TECHNOLOGY, JIST 2019, 2020, 1157 : 221 - 228
  • [2] An automatic end-to-end chemical synthesis development platform powered by large language models
    Ruan, Yixiang
    Lu, Chenyin
    Xu, Ning
    He, Yuchen
    Chen, Yixin
    Zhang, Jian
    Xuan, Jun
    Pan, Jianzhang
    Fang, Qun
    Gao, Hanyu
    Shen, Xiaodong
    Ye, Ning
    Zhang, Qiang
    Mo, Yiming
    NATURE COMMUNICATIONS, 2024, 15 (01)
  • [3] Construction of Legal Knowledge Graph Based on Knowledge-Enhanced Large Language Models
    Li, Jun
    Qian, Lu
    Liu, Peifeng
    Liu, Taoxiong
    INFORMATION, 2024, 15 (11)
  • [4] LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models
    Chen, Xi
    Zhang, Songyang
    Bai, Qibing
    Chen, Kai
    Nakamura, Satoshi
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 6976 - 6987
  • [5] KNOWLEDGE TRANSFER FROM LARGE-SCALE PRETRAINED LANGUAGE MODELS TO END-TO-END SPEECH RECOGNIZERS
    Kubo, Yotaro
    Karita, Shigeki
    Bacchiani, Michiel
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8512 - 8516
  • [6] Knowledge graph construction for intelligent cockpits based on large language models
    Dong, Haomin
    Wang, Wenbin
    Sun, Zhenjiang
    Kang, Ziyi
    Ge, Xiaojun
    Gao, Fei
    Wang, Jixin
    SCIENTIFIC REPORTS, 2025, 15 (01):
  • [7] A study of transformer-based end-to-end speech recognition system for Kazakh language
    Mamyrbayev, Orken
    Oralbekova, Dina
    Alimhan, Keylan
    Turdalykyzy, Tolganay
    Othman, Mohamed
    SCIENTIFIC REPORTS, 2022, 12 (01)
  • [8] A study of transformer-based end-to-end speech recognition system for Kazakh language
    Mamyrbayev Orken
    Oralbekova Dina
    Alimhan Keylan
    Turdalykyzy Tolganay
    Othman Mohamed
    Scientific Reports, 12
  • [9] Deep Learning-Based End-to-End Language Development Screening for Children Using Linguistic Knowledge
    Oh, Byoung-Doo
    Lee, Yoon-Kyoung
    Kim, Jong-Dae
    Park, Chan-Young
    Kim, Yu-Seop
    APPLIED SCIENCES-BASEL, 2022, 12 (09):
  • [10] Traditional Chinese Medicine Knowledge Graph Construction Based on Large Language Models
    Zhang, Yichong
    Hao, Yongtao
    ELECTRONICS, 2024, 13 (07)