Structured information extraction from scientific text with large language models

被引:91
|
作者
Dagdelen, John [1 ,2 ]
Dunn, Alexander [1 ,2 ]
Lee, Sanghoon [1 ,2 ]
Walker, Nicholas [1 ]
Rosen, Andrew S. [1 ,2 ]
Ceder, Gerbrand [1 ,2 ]
Persson, Kristin A. [1 ,2 ]
Jain, Anubhav [1 ]
机构
[1] Lawrence Berkeley Natl Lab, Berkeley, CA 94720 USA
[2] Univ Calif Berkeley, Mat Sci & Engn Dept, Berkeley, CA USA
关键词
CANCER RESISTANCE; CELLULAR SENESCENCE; PHYLOGENETIC ANALYSIS; PREMATURE SENESCENCE; MOLE-RAT; MECHANISMS; TRANSCRIPTION; DISCOVERY; ALIGNMENT; PROVIDES;
D O I
10.1038/s41467-024-45563-x
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Extracting structured knowledge from scientific text remains a challenging task for machine learning models. Here, we present a simple approach to joint named entity recognition and relation extraction and demonstrate how pretrained large language models (GPT-3, Llama-2) can be fine-tuned to extract useful records of complex scientific knowledge. We test three representative tasks in materials chemistry: linking dopants and host materials, cataloging metal-organic frameworks, and general composition/phase/morphology/application information extraction. Records are extracted from single sentences or entire paragraphs, and the output can be returned as simple English sentences or a more structured format such as a list of JSON objects. This approach represents a simple, accessible, and highly flexible route to obtaining large databases of structured specialized scientific knowledge extracted from research papers. Extracting scientific data from published research is a complex task required specialised tools. Here the authors present a scheme based on large language models to automatise the retrieval of information from text in a flexible and accessible manner.
引用
收藏
页数:14
相关论文
共 50 条
  • [41] Text Classification via Large Language Models
    Sun, Xiaofei
    Li, Xiaoya
    Li, Jiwei
    Wu, Fei
    Guo, Shangwei
    Zhang, Tianwei
    Wang, Guoyin
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 8990 - 9005
  • [42] Information extraction from weakly structured radiological reports with natural language queries
    Amin Dada
    Tim Leon Ufer
    Moon Kim
    Max Hasin
    Nicola Spieker
    Michael Forsting
    Felix Nensa
    Jan Egger
    Jens Kleesiek
    European Radiology, 2024, 34 : 330 - 337
  • [43] Information extraction from weakly structured radiological reports with natural language queries
    Dada, Amin
    Ufer, Tim Leon
    Kim, Moon
    Hasin, Max
    Spieker, Nicola
    Forsting, Michael
    Nensa, Felix
    Egger, Jan
    Kleesiek, Jens
    EUROPEAN RADIOLOGY, 2024, 34 (01) : 330 - 337
  • [44] Improving Text Embeddings with Large Language Models
    Wang, Liang
    Yang, Nan
    Huang, Xiaolong
    Yang, Linjun
    Majumder, Rangan
    Wei, Furu
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 11897 - 11916
  • [45] A maximum entropy approach to Information Extraction from semi-structured and free text
    Chien, HL
    Ng, HT
    EIGHTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-02)/FOURTEENTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE (IAAI-02), PROCEEDINGS, 2002, : 786 - 791
  • [46] AN ITERATIVE APPROACH TO THE TERMINOLOGY EXTRACTION FROM UKRAINIAN-LANGUAGE SCIENTIFIC TEXT CORPORA
    Glybovets, A. M.
    Reshetnov, I. V.
    CYBERNETICS AND SYSTEMS ANALYSIS, 2014, 50 (06) : 866 - 873
  • [47] Using ILP to construct features for information extraction from semi-structured text
    Ramakrishnan, Ganesh
    Joshil, Sachindra
    Balakrishnan, Sreeram
    Srinivasan, Ashwin
    INDUCTIVE LOGIC PROGRAMMING, 2008, 4894 : 211 - 224
  • [48] UNL as a text content representation language for information extraction
    Cardenosa, Jesus
    Gallardo, Carolina
    Iraola, Luis
    FLEXIBLE QUERY ANSWERING SYSTEMS, PROCEEDINGS, 2006, 4027 : 507 - 518
  • [49] An Entity Extraction Pipeline for Medical Text Records Using Large Language Models: Analytical Study
    Wang, Lei
    Ma, Yinyao
    Bi, Wenshuai
    Lv, Hanlin
    Li, Yuxiang
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
  • [50] Direct use of information extraction from scientific text for modeling and simulation in the life sciences
    Hoffman-Apitius, Martin
    Younesi, Erfan
    Kasam, Vinod
    LIBRARY HI TECH, 2009, 27 (04) : 505 - 519