Structured information extraction from scientific text with large language models

被引:91
|
作者
Dagdelen, John [1 ,2 ]
Dunn, Alexander [1 ,2 ]
Lee, Sanghoon [1 ,2 ]
Walker, Nicholas [1 ]
Rosen, Andrew S. [1 ,2 ]
Ceder, Gerbrand [1 ,2 ]
Persson, Kristin A. [1 ,2 ]
Jain, Anubhav [1 ]
机构
[1] Lawrence Berkeley Natl Lab, Berkeley, CA 94720 USA
[2] Univ Calif Berkeley, Mat Sci & Engn Dept, Berkeley, CA USA
关键词
CANCER RESISTANCE; CELLULAR SENESCENCE; PHYLOGENETIC ANALYSIS; PREMATURE SENESCENCE; MOLE-RAT; MECHANISMS; TRANSCRIPTION; DISCOVERY; ALIGNMENT; PROVIDES;
D O I
10.1038/s41467-024-45563-x
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Extracting structured knowledge from scientific text remains a challenging task for machine learning models. Here, we present a simple approach to joint named entity recognition and relation extraction and demonstrate how pretrained large language models (GPT-3, Llama-2) can be fine-tuned to extract useful records of complex scientific knowledge. We test three representative tasks in materials chemistry: linking dopants and host materials, cataloging metal-organic frameworks, and general composition/phase/morphology/application information extraction. Records are extracted from single sentences or entire paragraphs, and the output can be returned as simple English sentences or a more structured format such as a list of JSON objects. This approach represents a simple, accessible, and highly flexible route to obtaining large databases of structured specialized scientific knowledge extracted from research papers. Extracting scientific data from published research is a complex task required specialised tools. Here the authors present a scheme based on large language models to automatise the retrieval of information from text in a flexible and accessible manner.
引用
收藏
页数:14
相关论文
共 50 条
  • [1] Structured information extraction from scientific text with large language models
    John Dagdelen
    Alexander Dunn
    Sanghoon Lee
    Nicholas Walker
    Andrew S. Rosen
    Gerbrand Ceder
    Kristin A. Persson
    Anubhav Jain
    Nature Communications, 15
  • [2] Comparative Analysis of Large Language Models in Structured Information Extraction from Job Postings
    Sioziou, Kyriaki
    Zervas, Panagiotis
    Giotopoulos, Kostas
    Tzimas, Giannis
    ENGINEERING APPLICATIONS OF NEURAL NETWORKS, EANN 2024, 2024, 2141 : 82 - 92
  • [3] Extraction of Subjective Information from Large Language Models
    Kobayashi, Atsuya
    Yamaguchi, Saneyasu
    2024 IEEE 48TH ANNUAL COMPUTERS, SOFTWARE, AND APPLICATIONS CONFERENCE, COMPSAC 2024, 2024, : 1612 - 1617
  • [4] Large language models recover scientific collaboration networks from text
    Jeyaram, Rathin
    Ward, Robert N.
    Santolini, Marc
    APPLIED NETWORK SCIENCE, 2024, 9 (01)
  • [5] Investigations on Scientific Literature Meta Information Extraction Using Large Language Models
    Guo, Menghao
    Wu, Fan
    Jiang, Jinling
    Yan, Xiaoran
    Chen, Guangyong
    Li, Wenhui
    Zhao, Yunhong
    Sun, Zeyi
    2023 IEEE INTERNATIONAL CONFERENCE ON KNOWLEDGE GRAPH, ICKG, 2023, : 249 - 254
  • [6] Scalable information extraction from free text electronic health records using large language models
    Gu, Bowen
    Shao, Vivian
    Liao, Ziqian
    Carducci, Valentina
    Brufau, Santiago Romero
    Yang, Jie
    Desai, Rishi J.
    BMC MEDICAL RESEARCH METHODOLOGY, 2025, 25 (01)
  • [7] From text to insight: large language models for chemical data extraction
    Schilling-Wilhelmi, Mara
    Rios-Garcia, Martino
    Shabih, Sherjeel
    Gil, Maria Victoria
    Miret, Santiago
    Koch, Christoph T.
    Marquez, Jose A.
    Jablonka, Kevin Maik
    CHEMICAL SOCIETY REVIEWS, 2025, 54 (03) : 1125 - 1150
  • [8] Causality Extraction from Medical Text Using Large Language Models (LLMs)
    Gopalakrishnan, Seethalakshmi
    Garbayo, Luciana
    Zadrozny, Wlodek
    INFORMATION, 2025, 16 (01)
  • [9] Toward Reliable Biodiversity Information Extraction From Large Language Models
    Elliott, Michael J.
    Fortes, Jose A. B.
    2024 IEEE 20TH INTERNATIONAL CONFERENCE ON E-SCIENCE, E-SCIENCE 2024, 2024,
  • [10] Effective Structured Information Extraction from Chest Radiography Reports Using Open-Weights Large Language Models
    Gee, James C.
    Yao, Michael S.
    RADIOLOGY, 2025, 314 (01)