Structured information extraction from scientific text with large language models

被引:91
|
作者
Dagdelen, John [1 ,2 ]
Dunn, Alexander [1 ,2 ]
Lee, Sanghoon [1 ,2 ]
Walker, Nicholas [1 ]
Rosen, Andrew S. [1 ,2 ]
Ceder, Gerbrand [1 ,2 ]
Persson, Kristin A. [1 ,2 ]
Jain, Anubhav [1 ]
机构
[1] Lawrence Berkeley Natl Lab, Berkeley, CA 94720 USA
[2] Univ Calif Berkeley, Mat Sci & Engn Dept, Berkeley, CA USA
关键词
CANCER RESISTANCE; CELLULAR SENESCENCE; PHYLOGENETIC ANALYSIS; PREMATURE SENESCENCE; MOLE-RAT; MECHANISMS; TRANSCRIPTION; DISCOVERY; ALIGNMENT; PROVIDES;
D O I
10.1038/s41467-024-45563-x
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Extracting structured knowledge from scientific text remains a challenging task for machine learning models. Here, we present a simple approach to joint named entity recognition and relation extraction and demonstrate how pretrained large language models (GPT-3, Llama-2) can be fine-tuned to extract useful records of complex scientific knowledge. We test three representative tasks in materials chemistry: linking dopants and host materials, cataloging metal-organic frameworks, and general composition/phase/morphology/application information extraction. Records are extracted from single sentences or entire paragraphs, and the output can be returned as simple English sentences or a more structured format such as a list of JSON objects. This approach represents a simple, accessible, and highly flexible route to obtaining large databases of structured specialized scientific knowledge extracted from research papers. Extracting scientific data from published research is a complex task required specialised tools. Here the authors present a scheme based on large language models to automatise the retrieval of information from text in a flexible and accessible manner.
引用
收藏
页数:14
相关论文
共 50 条
  • [21] Information extraction using the structured language model
    Chelba, C
    Mahajan, M
    PROCEEDINGS OF THE 2001 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, 2001, : 74 - 81
  • [22] Privacy-preserving large language models for structured medical information retrieval
    Wiest, Isabella Catharina
    Ferber, Dyke
    Zhu, Jiefu
    van Treeck, Marko
    Meyer, Sonja K.
    Juglan, Radhika
    Carrero, Zunamys I.
    Paech, Daniel
    Kleesiek, Jens
    Ebert, Matthias P.
    Truhn, Daniel
    Kather, Jakob Nikolas
    NPJ DIGITAL MEDICINE, 2024, 7 (01):
  • [23] A Universal Prompting Strategy for Extracting Process Model Information from Natural Language Text Using Large Language Models
    Neuberger, Julian
    Ackermann, Lars
    van der Aa, Han
    Jablonski, Stefan
    CONCEPTUAL MODELING, ER 2024, 2025, 15238 : 38 - 55
  • [24] Exploring Large Language Models for Low-Resource IT Information Extraction
    Bhavya, Bhavya
    Isaza, Paulina Toro
    Deng, Yu
    Nidd, Michael
    Azad, Amar Prakash
    Shwartz, Larisa
    Zhai, ChengXiang
    2023 23RD IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS, ICDMW 2023, 2023, : 1203 - 1212
  • [25] Large Language Models for Conducting Advanced Text Analytics Information Systems Research
    Ampel, Benjamin
    Yang, Chi-heng
    Hu, James
    Chen, Hsinchun
    ACM TRANSACTIONS ON MANAGEMENT INFORMATION SYSTEMS, 2025, 16 (01)
  • [26] Text-like Encoding of Collaborative Information in Large Language Models for Recommendation
    Zhang, Yang
    Bao, Keqin
    Yan, Ming
    Wang, Wenjie
    Feng, Fuli
    He, Xiangnan
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 9181 - 9191
  • [27] Structured learning for spatial information extraction from biomedical text: bacteria biotopes
    Kordjamshidi, Parisa
    Roth, Dan
    Moens, Marie-Francine
    BMC BIOINFORMATICS, 2015, 16
  • [28] Intelligent extraction of reservoir dispatching information integrating large language model and structured prompts
    Yangrui Yang
    Sisi Chen
    Yaping Zhu
    Xuemei Liu
    Wei Ma
    Ling Feng
    Scientific Reports, 14 (1)
  • [29] Structured learning for spatial information extraction from biomedical text: bacteria biotopes
    Parisa Kordjamshidi
    Dan Roth
    Marie-Francine Moens
    BMC Bioinformatics, 16
  • [30] Intelligent extraction of reservoir dispatching information integrating large language model and structured prompts
    Yang, Yangrui
    Chen, Sisi
    Zhu, Yaping
    Liu, Xuemei
    Ma, Wei
    Feng, Ling
    SCIENTIFIC REPORTS, 2024, 14 (01):