An Automatic and End-to-End System for Rare Disease Knowledge Graph Construction Based on Ontology- Enhanced Large Language Models: Development Study

被引：0

作者：

Cao, Lang ^{[1
]}

Sun, Jimeng ^{[1
]}

Cross, Adam ^{[2
]}

机构：

[1] Univ Illinois, Dept Comp Sci, Urbana, IL USA

[2] Univ Illinois, Coll Med Peoria, Dept Pediat, 1 Illini Dr, Peoria, IL 61605 USA

来源：

JMIR MEDICAL INFORMATICS | 2024年 / 12卷

关键词：

rare disease; clinical informatics; LLM; natural language processing; machine learning; artificial intelligence; large language models; data extraction; ontologies; knowledge graphs; text mining;

D O I：

10.2196/60665

中图分类号：

R-058 [];

学科分类号：

摘要：

Background: Rare diseases affect millions worldwide but sometimes face limited research focus individually due to low prevalence. Many rare diseases do not have specific International Classification of Diseases, Ninth Edition ( ICD-9 ) and Tenth Edition ( ICD-10 ), codes and therefore cannot be reliably extracted from granular fields like "Diagnosis" and "Problem List" entries, which complicates tasks that require identification of patients with these conditions, including clinical trial recruitment and research efforts. Recent advancements in large language models (LLMs) have shown promise in automating the extraction of medical information, offering the potential to improve medical research, diagnosis, and management. However, most LLMs lack professional medical knowledge, especially concerning specific rare diseases, and cannot effectively manage rare disease data in its various ontological forms, making it unsuitable for these tasks. Objective: Our aim is to create an end-to-end system called automated rare disease mining (AutoRD), which automates the extraction of rare disease-related information from medical text, focusing on entities and their relations to other medical concepts, such as signs and symptoms. AutoRD integrates up-to-date ontologies with other structured knowledge and demonstrates superior performance in rare disease extraction tasks. We conducted various experiments to evaluate AutoRD's performance, aiming to surpass common LLMs and traditional methods. Methods: AutoRD is a pipeline system that involves data preprocessing, entity extraction, relation extraction, entity calibration, and knowledge graph construction. We implemented this system using GPT-4 and medical knowledge graphs developed from the open-source Human Phenotype and Orphanet ontologies, using techniques such as chain-of-thought reasoning and prompt engineering. We quantitatively evaluated our system's performance in entity extraction, relation extraction, and knowledge graph construction. The experiment used the well-curated dataset RareDis2023, which contains medical literature focused on rare disease entities and their relations, making it an ideal dataset for training and testing our methodology. Results: On the RareDis2023 dataset, AutoRD achieved an overall entity extraction F 1-score of 56.1% and a relation extraction F 1-score of 38.6%, marking a 14.4% improvement over the baseline LLM. Notably, the F 1-score for rare disease entity extraction reached 83.5%, indicating high precision and recall in identifying rare disease mentions. These results demonstrate the effectiveness of integrating LLMs with medical ontologies in extracting complex rare disease information. Conclusions: AutoRD is an automated end-to-end system for extracting rare disease information from text to build knowledge graphs, addressing critical limitations of existing LLMs by improving identification of these diseases and connecting them to related clinical features. This work underscores the significant potential of LLMs in transforming health care, particularly in the rare disease domain. By leveraging ontology-enhanced LLMs, AutoRD constructs a robust medical knowledge base that incorporates up-to-date rare disease information, facilitating improved identification of patients and resulting in more inclusive research and trial candidacy efforts.

引用

页数：14

共 15 条

[1] EasyKG: An End-to-End Knowledge Graph Construction System
Jia, Yantao
Liu, Dong
Sheng, Zhicheng
Feng, Letian
Liu, Yi
Guo, Shuo
SEMANTIC TECHNOLOGY, JIST 2019, 2020, 1157 : 221 - 228
[2] An automatic end-to-end chemical synthesis development platform powered by large language models
Ruan, Yixiang
Lu, Chenyin
Xu, Ning
He, Yuchen
Chen, Yixin
Zhang, Jian
Xuan, Jun
Pan, Jianzhang
Fang, Qun
Gao, Hanyu
Shen, Xiaodong
Ye, Ning
Zhang, Qiang
Mo, Yiming
NATURE COMMUNICATIONS, 2024, 15 (01)
[3] Construction of Legal Knowledge Graph Based on Knowledge-Enhanced Large Language Models
Li, Jun
Qian, Lu
Liu, Peifeng
Liu, Taoxiong
INFORMATION, 2024, 15 (11)
[4] LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models
Chen, Xi
Zhang, Songyang
Bai, Qibing
Chen, Kai
Nakamura, Satoshi
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 6976 - 6987
[5] KNOWLEDGE TRANSFER FROM LARGE-SCALE PRETRAINED LANGUAGE MODELS TO END-TO-END SPEECH RECOGNIZERS
Kubo, Yotaro
Karita, Shigeki
Bacchiani, Michiel
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8512 - 8516
[6] Knowledge graph construction for intelligent cockpits based on large language models
Dong, Haomin
Wang, Wenbin
Sun, Zhenjiang
Kang, Ziyi
Ge, Xiaojun
Gao, Fei
Wang, Jixin
SCIENTIFIC REPORTS, 2025, 15 (01):
[7] A study of transformer-based end-to-end speech recognition system for Kazakh language
Mamyrbayev, Orken
Oralbekova, Dina
Alimhan, Keylan
Turdalykyzy, Tolganay
Othman, Mohamed
SCIENTIFIC REPORTS, 2022, 12 (01)
[8] A study of transformer-based end-to-end speech recognition system for Kazakh language
Mamyrbayev Orken
Oralbekova Dina
Alimhan Keylan
Turdalykyzy Tolganay
Othman Mohamed
Scientific Reports, 12
[9] Deep Learning-Based End-to-End Language Development Screening for Children Using Linguistic Knowledge
Oh, Byoung-Doo
Lee, Yoon-Kyoung
Kim, Jong-Dae
Park, Chan-Young
Kim, Yu-Seop
APPLIED SCIENCES-BASEL, 2022, 12 (09):
[10] Traditional Chinese Medicine Knowledge Graph Construction Based on Large Language Models
Zhang, Yichong
Hao, Yongtao
ELECTRONICS, 2024, 13 (07)

← 1 2 →