An Accurate and Efficient Approach to Knowledge Extraction from Scientific Publications Using Structured Ontology Models, Graph Neural Networks, and Large Language Models

被引:1
|
作者
Ivanisenko, Timofey V. [1 ,2 ]
Demenkov, Pavel S. [1 ,2 ]
Ivanisenko, Vladimir A. [1 ,2 ]
机构
[1] Novosibirsk State Univ, Artificial Intelligence Res Ctr, Pirogova St 1, Novosibirsk 630090, Russia
[2] Russian Acad Sci, Siberian Branch, Inst Cytol & Genet, Prospekt Lavrentyeva 10, Novosibirsk 630090, Russia
关键词
text-mining; ANDSystem; deep learning; GNN; LLM; knowledge graph; FUNCTIONAL MODULES; GENE NETWORKS; RECONSTRUCTION; ASSOCIATION; COMPLEXES; SEROTONIN; BIOLOGY; BINDING; SYSTEMS; SLEEP;
D O I
10.3390/ijms252111811
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
The rapid growth of biomedical literature makes it challenging for researchers to stay current. Integrating knowledge from various sources is crucial for studying complex biological systems. Traditional text-mining methods often have limited accuracy because they don't capture semantic and contextual nuances. Deep-learning models can be computationally expensive and typically have low interpretability, though efforts in explainable AI aim to mitigate this. Furthermore, transformer-based models have a tendency to produce false or made-up information-a problem known as hallucination-which is especially prevalent in large language models (LLMs). This study proposes a hybrid approach combining text-mining techniques with graph neural networks (GNNs) and fine-tuned large language models (LLMs) to extend biomedical knowledge graphs and interpret predicted edges based on published literature. An LLM is used to validate predictions and provide explanations. Evaluated on a corpus of experimentally confirmed protein interactions, the approach achieved a Matthews correlation coefficient (MCC) of 0.772. Applied to insomnia, the approach identified 25 interactions between 32 human proteins absent in known knowledge bases, including regulatory interactions between MAOA and 5-HT2C, binding between ADAM22 and 14-3-3 proteins, which is implicated in neurological diseases, and a circadian regulatory loop involving RORB and NR1D1. The hybrid GNN-LLM method analyzes biomedical literature efficiency to uncover potential molecular interactions for complex disorders. It can accelerate therapeutic target discovery by focusing expert verification on the most relevant automatically extracted information.
引用
收藏
页数:27
相关论文
共 50 条
  • [31] A Systematic Approach to Prompting Large Language Models for Automated Feature Extraction from Cardiovascular Imaging Reports
    Goldfinger, Shir
    Mackay, Emily
    Chan, Trevor
    Eswar, Vikram
    Grasfield, Rachel
    Yan, Vivian
    Barreto, David
    Pouch, Alison
    CIRCULATION, 2024, 150
  • [32] Reliability of large language models as a tool for knowledge extraction from biographical dictionaries: the case of the Polish Biographical Dictionary
    Jaskulski, Piotr
    Latos, Tomasz
    Rynca, Mariusz
    Zapala, Adam
    DIGITAL SCHOLARSHIP IN THE HUMANITIES, 2025,
  • [33] Extracting chemical food safety hazards from the scientific literature automatically using large language models
    Ozen, Neris
    Mu, Wenjuan
    Asselt, Esther D. van
    van den Bulk, Leonieke M.
    APPLIED FOOD RESEARCH, 2025, 5 (01):
  • [34] An Automatic and End-to-End System for Rare Disease Knowledge Graph Construction Based on Ontology- Enhanced Large Language Models: Development Study
    Cao, Lang
    Sun, Jimeng
    Cross, Adam
    JMIR MEDICAL INFORMATICS, 2024, 12
  • [35] Leveraging protein language models and graph convolutional neural networks for accurate prediction of ligand bioactivity in class A G protein-coupled receptors
    Provasi, Davide
    Riina, Nicholas
    Cullen, Olivia
    Filizola, Marta
    BIOPHYSICAL JOURNAL, 2024, 123 (03) : 428A - 428A
  • [36] Scalable information extraction from free text electronic health records using large language models
    Gu, Bowen
    Shao, Vivian
    Liao, Ziqian
    Carducci, Valentina
    Brufau, Santiago Romero
    Yang, Jie
    Desai, Rishi J.
    BMC MEDICAL RESEARCH METHODOLOGY, 2025, 25 (01)
  • [37] Advancing language models through domain knowledge integration: a comprehensive approach to training, evaluation, and optimization of social scientific neural word embeddings
    Stoehr, Fabian
    JOURNAL OF COMPUTATIONAL SOCIAL SCIENCE, 2024, 7 (02): : 1753 - 1793
  • [38] OmEGa(Ω): Ontology-based information extraction framework for constructing task-centric knowledge graph from manufacturing documents with large language model
    Shim, Midan
    Choi, Hyojun
    Koo, Heeyeon
    Um, Kaehyun
    Lee, Kyong-Ho
    Lee, Sanghyun
    ADVANCED ENGINEERING INFORMATICS, 2025, 64
  • [39] Extraction and classification of structured data from unstructured hepatobiliary pathology reports using large language models: a feasibility study compared with rules-based natural language processing
    Geevarghese, Ruben
    Sigel, Carlie
    Cadley, John
    Chatterjee, Subrata
    Jain, Pulkit
    Hollingsworth, Alex
    Chatterjee, Avijit
    Swinburne, Nathaniel
    Bilal, Khawaja Hasan
    Marinelli, Brett
    JOURNAL OF CLINICAL PATHOLOGY, 2024,
  • [40] AUTOMATED EXTRACTION OF COST-EFFECTIVENESS MODELS DATA FROM HEALTH TECHNOLOGY ASSESSMENT SUBMISSIONS USING LARGE-LANGUAGE MODELS (LLMS): DOES THE PROMPTING APPROACH MATTER?
    Szabo, G.
    Pinsent, A.
    Slim, M.
    Sullivan, S.
    Benedict, A.
    Rivolo, S.
    VALUE IN HEALTH, 2024, 27 (12)