Data extraction from polymer literature using large language models

被引:2
|
作者
Gupta, Sonakshi [1 ]
Mahmood, Akhlak [2 ]
Shetty, Pranav [1 ]
Adeboye, Aishat [3 ]
Ramprasad, Rampi [2 ]
机构
[1] Georgia Inst Technol, Sch Computat Sci & Engn, Atlanta, GA USA
[2] Georgia Inst Technol, Sch Mat Sci & Engn, Atlanta, GA 30332 USA
[3] Georgia Inst Technol, Sch Chem & Biomol Engn, Atlanta, GA USA
关键词
Natural language processing systems;
D O I
10.1038/s43246-024-00708-9
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
Automated data extraction from materials science literature at scale using artificial intelligence and natural language processing techniques is critical to advance materials discovery. However, this process for large spans of text continues to be a challenge due to the specific nature and styles of scientific manuscripts. In this study, we present a framework to automatically extract polymer-property data from full-text journal articles using commercially available (GPT-3.5) and open-source (LlaMa 2) large language models (LLM), in tandem with the named entity recognition (NER)-based MaterialsBERT model. Leveraging a corpus of similar to 2.4 million full text articles, our method successfully identified and processed around 681,000 polymer-related articles, resulting in the extraction of over one million records corresponding to 24 properties of over 106,000 unique polymers. We additionally conducted an extensive evaluation of the performance and associated costs of the LLMs used for data extraction, compared to the NER model. We suggest methodologies to optimize costs, provide insights on effective inference via in-context few-shots learning, and illuminate gaps and opportunities for future studies utilizing LLMs for natural language processing in polymer science. The extracted polymer-property data has been made publicly available for the wider scientific community via the Polymer Scholar website.
引用
收藏
页数:11
相关论文
共 50 条
  • [31] Using Large Language Models to Enhance the Reusability of Sensor Data
    Berenguer, Alberto
    Morejon, Adriana
    Tomas, David
    Mazon, Jose-Norberto
    SENSORS, 2024, 24 (02)
  • [32] Structured information extraction from scientific text with large language models
    John Dagdelen
    Alexander Dunn
    Sanghoon Lee
    Nicholas Walker
    Andrew S. Rosen
    Gerbrand Ceder
    Kristin A. Persson
    Anubhav Jain
    Nature Communications, 15
  • [33] Structured information extraction from scientific text with large language models
    Dagdelen, John
    Dunn, Alexander
    Lee, Sanghoon
    Walker, Nicholas
    Rosen, Andrew S.
    Ceder, Gerbrand
    Persson, Kristin A.
    Jain, Anubhav
    NATURE COMMUNICATIONS, 2024, 15 (01)
  • [34] Toward Reliable Biodiversity Information Extraction From Large Language Models
    Elliott, Michael J.
    Fortes, Jose A. B.
    2024 IEEE 20TH INTERNATIONAL CONFERENCE ON E-SCIENCE, E-SCIENCE 2024, 2024,
  • [35] Enhancing Relation Extraction from Biomedical Texts by Large Language Models
    Asada, Masaki
    Fukuda, Ken
    ARTIFICIAL INTELLIGENCE IN HCI, PT III, AI-HCI 2024, 2024, 14736 : 3 - 14
  • [36] Data Extraction from Natural Language using Universal Networking Language
    Saha, Aloke Kumar
    Mridha, M. F.
    Rafiq, Jahir Ibna
    Das, Jugal Krishna
    2017 INTERNATIONAL CONFERENCE ON CURRENT TRENDS IN COMPUTER, ELECTRICAL, ELECTRONICS AND COMMUNICATION (CTCEEC), 2017, : 1228 - 1232
  • [37] Large language models as partners in medical literature
    Perez-Guerrero, Eduardo J.
    Mehrotra, Isha
    Jain, Sneha S.
    V. Perez, Marco
    HEART RHYTHM, 2025, 22 (02) : 579 - 584
  • [38] Scalable information extraction from free text electronic health records using large language models
    Gu, Bowen
    Shao, Vivian
    Liao, Ziqian
    Carducci, Valentina
    Brufau, Santiago Romero
    Yang, Jie
    Desai, Rishi J.
    BMC MEDICAL RESEARCH METHODOLOGY, 2025, 25 (01)
  • [39] Using Augmented Small Multimodal Models to Guide Large Language Models for Multimodal Relation Extraction
    He, Wentao
    Ma, Hanjie
    Li, Shaohua
    Dong, Hui
    Zhang, Haixiang
    Feng, Jie
    APPLIED SCIENCES-BASEL, 2023, 13 (22):
  • [40] Instruction Tuning Large Language Models for Multimodal Relation Extraction Using LoRA
    Li, Zou
    Pang, Ning
    Zhao, Xiang
    WEB INFORMATION SYSTEMS AND APPLICATIONS, WISA 2024, 2024, 14883 : 364 - 376