Data extraction from polymer literature using large language models

被引:2
|
作者
Gupta, Sonakshi [1 ]
Mahmood, Akhlak [2 ]
Shetty, Pranav [1 ]
Adeboye, Aishat [3 ]
Ramprasad, Rampi [2 ]
机构
[1] Georgia Inst Technol, Sch Computat Sci & Engn, Atlanta, GA USA
[2] Georgia Inst Technol, Sch Mat Sci & Engn, Atlanta, GA 30332 USA
[3] Georgia Inst Technol, Sch Chem & Biomol Engn, Atlanta, GA USA
关键词
Natural language processing systems;
D O I
10.1038/s43246-024-00708-9
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
Automated data extraction from materials science literature at scale using artificial intelligence and natural language processing techniques is critical to advance materials discovery. However, this process for large spans of text continues to be a challenge due to the specific nature and styles of scientific manuscripts. In this study, we present a framework to automatically extract polymer-property data from full-text journal articles using commercially available (GPT-3.5) and open-source (LlaMa 2) large language models (LLM), in tandem with the named entity recognition (NER)-based MaterialsBERT model. Leveraging a corpus of similar to 2.4 million full text articles, our method successfully identified and processed around 681,000 polymer-related articles, resulting in the extraction of over one million records corresponding to 24 properties of over 106,000 unique polymers. We additionally conducted an extensive evaluation of the performance and associated costs of the LLMs used for data extraction, compared to the NER model. We suggest methodologies to optimize costs, provide insights on effective inference via in-context few-shots learning, and illuminate gaps and opportunities for future studies utilizing LLMs for natural language processing in polymer science. The extracted polymer-property data has been made publicly available for the wider scientific community via the Polymer Scholar website.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] USING LARGE LANGUAGE MODELS (LLMS) FOR DATA EXTRACTION IN LITERATURE REVIEWS: AN ENHANCED APPROACH
    Lambova, A.
    Matev, K.
    Gallinaro, J.
    Guerra, I
    Rtveladze, K.
    Caverly, S.
    VALUE IN HEALTH, 2024, 27 (12)
  • [2] Investigations on Scientific Literature Meta Information Extraction Using Large Language Models
    Guo, Menghao
    Wu, Fan
    Jiang, Jinling
    Yan, Xiaoran
    Chen, Guangyong
    Li, Wenhui
    Zhao, Yunhong
    Sun, Zeyi
    2023 IEEE INTERNATIONAL CONFERENCE ON KNOWLEDGE GRAPH, ICKG, 2023, : 249 - 254
  • [3] From text to insight: large language models for chemical data extraction
    Schilling-Wilhelmi, Mara
    Rios-Garcia, Martino
    Shabih, Sherjeel
    Gil, Maria Victoria
    Miret, Santiago
    Koch, Christoph T.
    Marquez, Jose A.
    Jablonka, Kevin Maik
    CHEMICAL SOCIETY REVIEWS, 2025, 54 (03) : 1125 - 1150
  • [4] Bioregulatory event extraction using large language models: a case study of rice literature
    Xinzhi Yao
    Zhihan He
    Jingbo Xia
    Genomics & Informatics, 22 (1)
  • [5] Automated knowledge extraction from polymer literature using natural language processing
    Shetty, Pranav
    Ramprasad, Rampi
    ISCIENCE, 2021, 24 (01)
  • [6] From Large Language Models to Large Multimodal Models: A Literature Review
    Huang, Dawei
    Yan, Chuan
    Li, Qing
    Peng, Xiaojiang
    APPLIED SCIENCES-BASEL, 2024, 14 (12):
  • [7] High-Throughput Extraction of Phase-Property Relationships from Literature Using Natural Language Processing and Large Language Models
    Montanelli, Luca
    Venugopal, Vineeth
    Olivetti, Elsa A.
    Latypov, Marat I.
    INTEGRATING MATERIALS AND MANUFACTURING INNOVATION, 2024, 13 (2) : 396 - 405
  • [8] Label extraction from PET/CT reports using Large Language Models
    Bracci, J.
    Capobianco, N.
    Shah, V.
    Spottiswoode, B.
    Giobergia, F.
    EUROPEAN JOURNAL OF NUCLEAR MEDICINE AND MOLECULAR IMAGING, 2024, 51 : S193 - S193
  • [9] Causality Extraction from Medical Text Using Large Language Models (LLMs)
    Gopalakrishnan, Seethalakshmi
    Garbayo, Luciana
    Zadrozny, Wlodek
    INFORMATION, 2025, 16 (01)
  • [10] Goal Model Extraction from User Stories Using Large Language Models
    Siddeshwar, Vaishali
    Alwidian, Sanaa
    Makrehchi, Masoud
    QUALITY OF INFORMATION AND COMMUNICATIONS TECHNOLOGY, QUATIC 2024, 2024, 2178 : 269 - 276