Structured information extraction from scientific text with large language models

被引:91
|
作者
Dagdelen, John [1 ,2 ]
Dunn, Alexander [1 ,2 ]
Lee, Sanghoon [1 ,2 ]
Walker, Nicholas [1 ]
Rosen, Andrew S. [1 ,2 ]
Ceder, Gerbrand [1 ,2 ]
Persson, Kristin A. [1 ,2 ]
Jain, Anubhav [1 ]
机构
[1] Lawrence Berkeley Natl Lab, Berkeley, CA 94720 USA
[2] Univ Calif Berkeley, Mat Sci & Engn Dept, Berkeley, CA USA
关键词
CANCER RESISTANCE; CELLULAR SENESCENCE; PHYLOGENETIC ANALYSIS; PREMATURE SENESCENCE; MOLE-RAT; MECHANISMS; TRANSCRIPTION; DISCOVERY; ALIGNMENT; PROVIDES;
D O I
10.1038/s41467-024-45563-x
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Extracting structured knowledge from scientific text remains a challenging task for machine learning models. Here, we present a simple approach to joint named entity recognition and relation extraction and demonstrate how pretrained large language models (GPT-3, Llama-2) can be fine-tuned to extract useful records of complex scientific knowledge. We test three representative tasks in materials chemistry: linking dopants and host materials, cataloging metal-organic frameworks, and general composition/phase/morphology/application information extraction. Records are extracted from single sentences or entire paragraphs, and the output can be returned as simple English sentences or a more structured format such as a list of JSON objects. This approach represents a simple, accessible, and highly flexible route to obtaining large databases of structured specialized scientific knowledge extracted from research papers. Extracting scientific data from published research is a complex task required specialised tools. Here the authors present a scheme based on large language models to automatise the retrieval of information from text in a flexible and accessible manner.
引用
收藏
页数:14
相关论文
共 50 条
  • [31] A Hybrid Approach for Spatial Information Extraction from Natural Language Text
    Hassini, Nesrine
    Mahmoudi, Khaoula
    Faiz, Sami
    2023 20TH ACS/IEEE INTERNATIONAL CONFERENCE ON COMPUTER SYSTEMS AND APPLICATIONS, AICCSA, 2023,
  • [32] Large language models and scientific publishing
    Rousseau, Ronald
    Yang, Liying
    Bollen, Johan
    Shen, Zhesi
    JOURNAL OF DATA AND INFORMATION SCIENCE, 2023, 8 (01) : 1 - 1
  • [33] Large language models and scientific publishing
    Ronald Rousseau
    Liying Yang
    Johan Bollen
    Zhesi Shen
    Journal of Data and Information Science, 2023, 8 (01) : 1
  • [34] Large language models and scientific publishing
    Ronald Rousseau
    Liying Yang
    Johan Bollen
    Zhesi Shen
    Journal of Data and Information Science, 2023, (01) : 1 - 1
  • [35] Large language models in extracting key information from ICU patient text records from an Irish population: Comment
    Daungsupawong, Hinpetch
    Wiwanitkit, Viroj
    INTENSIVE CARE MEDICINE EXPERIMENTAL, 2024, 12 (01):
  • [36] Debiasing Large Language Models with Structured Knowledge
    Ma, Congda
    Zhao, Tianyu
    Okumura, Manabu
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 10274 - 10287
  • [37] FinBERT: A Large Language Model for Extracting Information from Financial Text
    Huang, Allen H.
    Wang, Hui
    Yang, Yi
    CONTEMPORARY ACCOUNTING RESEARCH, 2023, 40 (02) : 806 - 841
  • [38] Information Extraction Models for German Clinical Text
    Roller, Roland
    Seiffe, Laura
    Ayach, Ammer
    Moller, Sebastian
    Marten, Oliver
    Mikhailov, Michael
    Alt, Christoph
    Schmidt, Danilo
    Halleck, Fabian
    Naik, Marcel
    Duettmann, Wiebke
    Budde, Klemens
    2020 8TH IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS (ICHI 2020), 2020, : 527 - 528
  • [39] A comparison of statistical models for the extraction of lexical information from text corpora
    Dennis, S
    PROCEEDINGS OF THE TWENTY-FIFTH ANNUAL CONFERENCE OF THE COGNITIVE SCIENCE SOCIETY, PTS 1 AND 2, 2003, : 330 - 335
  • [40] Structured Text Summarization via Open Domain Information Extraction
    Hao, Zengguang
    Xu, Binxia
    Zheng, Shiyuan
    Gao, Yang
    PROCEEDINGS OF THE 2018 IEEE 22ND INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN ((CSCWD)), 2018, : 701 - 706