An Accurate and Efficient Approach to Knowledge Extraction from Scientific Publications Using Structured Ontology Models, Graph Neural Networks, and Large Language Models

被引：1

作者：

Ivanisenko, Timofey V. ^{[1
,2
]}

Demenkov, Pavel S. ^{[1
,2
]}

Ivanisenko, Vladimir A. ^{[1
,2
]}

机构：

[1] Novosibirsk State Univ, Artificial Intelligence Res Ctr, Pirogova St 1, Novosibirsk 630090, Russia

[2] Russian Acad Sci, Siberian Branch, Inst Cytol & Genet, Prospekt Lavrentyeva 10, Novosibirsk 630090, Russia

来源：

INTERNATIONAL JOURNAL OF MOLECULAR SCIENCES | 2024年 / 25卷 / 21期

关键词：

text-mining; ANDSystem; deep learning; GNN; LLM; knowledge graph; FUNCTIONAL MODULES; GENE NETWORKS; RECONSTRUCTION; ASSOCIATION; COMPLEXES; SEROTONIN; BIOLOGY; BINDING; SYSTEMS; SLEEP;

D O I：

10.3390/ijms252111811

中图分类号：

Q5 [生物化学]; Q7 [分子生物学];

学科分类号：

071010 ; 081704 ;

摘要：

The rapid growth of biomedical literature makes it challenging for researchers to stay current. Integrating knowledge from various sources is crucial for studying complex biological systems. Traditional text-mining methods often have limited accuracy because they don't capture semantic and contextual nuances. Deep-learning models can be computationally expensive and typically have low interpretability, though efforts in explainable AI aim to mitigate this. Furthermore, transformer-based models have a tendency to produce false or made-up information-a problem known as hallucination-which is especially prevalent in large language models (LLMs). This study proposes a hybrid approach combining text-mining techniques with graph neural networks (GNNs) and fine-tuned large language models (LLMs) to extend biomedical knowledge graphs and interpret predicted edges based on published literature. An LLM is used to validate predictions and provide explanations. Evaluated on a corpus of experimentally confirmed protein interactions, the approach achieved a Matthews correlation coefficient (MCC) of 0.772. Applied to insomnia, the approach identified 25 interactions between 32 human proteins absent in known knowledge bases, including regulatory interactions between MAOA and 5-HT2C, binding between ADAM22 and 14-3-3 proteins, which is implicated in neurological diseases, and a circadian regulatory loop involving RORB and NR1D1. The hybrid GNN-LLM method analyzes biomedical literature efficiency to uncover potential molecular interactions for complex disorders. It can accelerate therapeutic target discovery by focusing expert verification on the most relevant automatically extracted information.

引用

页数：27

共 50 条

[41] Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation
Ntinopoulos, Vasileios
Biefer, Hector Rodriguez Cetina
Tudorache, Igor
Papadopoulos, Nestoras
Odavic, Dragan
Risteski, Petar
Haeussler, Achim
Dzemali, Omer
BMJ HEALTH & CARE INFORMATICS, 2025, 32 (01)
[42] Incorporating Domain Knowledge Into Language Models by Using Graph Convolutional Networks for Assessing Semantic Textual Similarity: Model Development and Performance Comparison
Chang, David
Lin, Eric
Brandt, Cynthia
Taylor, Richard Andrew
JMIR MEDICAL INFORMATICS, 2021, 9 (11)
[43] Enhancing Zero-shot Audio Classification using Sound Attribute Knowledge from Large Language Models
Xu, Xuenan
Zhang, Pingyue
Yang, Ming
Zhang, Ji
Wu, Mengyue
INTERSPEECH 2024, 2024, : 4808 - 4812
[44] High-Throughput Extraction of Phase-Property Relationships from Literature Using Natural Language Processing and Large Language Models
Montanelli, Luca
Venugopal, Vineeth
Olivetti, Elsa A.
Latypov, Marat I.
INTEGRATING MATERIALS AND MANUFACTURING INNOVATION, 2024, 13 (2) : 396 - 405
[45] Analyzing the importance of network topology in AADT estimation: insights from travel demand models using graph neural networks
Zhen, Hao
Yang, Jidong J.
TRANSPORTATION, 2024,
[46] Extraction of piecewise-linear analog circuit models from trained neural networks using hidden neuron clustering
Doboli, S
Gothoskar, G
Doboli, A
DESIGN, AUTOMATION AND TEST IN EUROPE CONFERENCE AND EXHIBITION, PROCEEDINGS, 2003, : 1098 - 1099
[47] Predictive Modelling for Sensitive Social Media Contents Using Entropy-FlowSort and Artificial Neural Networks Initialized by Large Language Models
Galamiton, Narcisan
Bacus, Suzette
Fuentes, Noreen
Ugang, Janeth
Villarosa, Rica
Wenceslao, Charldy
Ocampo, Lanndon
INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE SYSTEMS, 2024, 17 (01)
[48] Extracting phenotypes from clinical descriptions using large language models: a comparison between automated and manual approach.
Berardelli, Silvia
Gazzo, Andrea
De Paoli, Federica
Limongelli, Ivan
Rizzo, Ettore
Magni, Paolo
Zucca, Susanna
EUROPEAN JOURNAL OF HUMAN GENETICS, 2024, 32 : 1630 - 1631
[49] Amplifying commonsense knowledge via bi-directional relation integrated graph-based contrastive pre-training from large language models☆
Yu, Liu
Tian, Fenghui
Kuang, Ping
Zhou, Fan
INFORMATION PROCESSING & MANAGEMENT, 2025, 62 (03)
[50] DeepEpiIL13: Deep Learning for Rapid and Accurate Prediction of IL-13-Inducing Epitopes Using Pretrained Language Models and Multiwindow Convolutional Neural Networks
Chuang, Cheng-Che
Liu, Yu-Chen
Ou, Yu-Yen
ACS OMEGA, 2025, 10 (09): : 9675 - 9683

← 1 2 3 4 5 →