Data extraction from polymer literature using large language models

被引:2
|
作者
Gupta, Sonakshi [1 ]
Mahmood, Akhlak [2 ]
Shetty, Pranav [1 ]
Adeboye, Aishat [3 ]
Ramprasad, Rampi [2 ]
机构
[1] Georgia Inst Technol, Sch Computat Sci & Engn, Atlanta, GA USA
[2] Georgia Inst Technol, Sch Mat Sci & Engn, Atlanta, GA 30332 USA
[3] Georgia Inst Technol, Sch Chem & Biomol Engn, Atlanta, GA USA
关键词
Natural language processing systems;
D O I
10.1038/s43246-024-00708-9
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
Automated data extraction from materials science literature at scale using artificial intelligence and natural language processing techniques is critical to advance materials discovery. However, this process for large spans of text continues to be a challenge due to the specific nature and styles of scientific manuscripts. In this study, we present a framework to automatically extract polymer-property data from full-text journal articles using commercially available (GPT-3.5) and open-source (LlaMa 2) large language models (LLM), in tandem with the named entity recognition (NER)-based MaterialsBERT model. Leveraging a corpus of similar to 2.4 million full text articles, our method successfully identified and processed around 681,000 polymer-related articles, resulting in the extraction of over one million records corresponding to 24 properties of over 106,000 unique polymers. We additionally conducted an extensive evaluation of the performance and associated costs of the LLMs used for data extraction, compared to the NER model. We suggest methodologies to optimize costs, provide insights on effective inference via in-context few-shots learning, and illuminate gaps and opportunities for future studies utilizing LLMs for natural language processing in polymer science. The extracted polymer-property data has been made publicly available for the wider scientific community via the Polymer Scholar website.
引用
收藏
页数:11
相关论文
共 50 条
  • [21] Event Extraction and Semantic Representation from Spanish Workers' Statute Using Large Language Models
    Terron, Gabriela Arguelles
    Chozas, Patricia Martin
    Doncel, Victor Rodriguez
    LEGAL KNOWLEDGE AND INFORMATION SYSTEMS, 2023, 379 : 329 - 334
  • [22] A case study for automated attribute extraction from legal documents using large language models
    Adhikary, Subinay
    Sen, Procheta
    Roy, Dwaipayan
    Ghosh, Kripabandhu
    ARTIFICIAL INTELLIGENCE AND LAW, 2024,
  • [23] Towards automated phenotype definition extraction using large language models
    Ramya Tekumalla
    Juan M. Banda
    Genomics & Informatics, 22 (1)
  • [24] Suitability of large language models for extraction of high-quality chemical reaction dataset from patent literature
    Vangala, Sarveswara Rao
    Krishnan, Sowmya Ramaswamy
    Bung, Navneet
    Nandagopal, Dhandapani
    Ramasamy, Gomathi
    Kumar, Satyam
    Sankaran, Sridharan
    Srinivasan, Rajgopal
    Roy, Arijit
    JOURNAL OF CHEMINFORMATICS, 2024, 16 (01):
  • [25] From promise to practice: challenges and pitfalls in the evaluation of large language models for data extraction in evidence synthesis
    Gartlehner, Gerald
    Kahwati, Leila
    Nussbaumer-Streit, Barbara
    Crotty, Karen
    Hilscher, Rainer
    Kugley, Shannon
    Viswanathan, Meera
    Thomas, Ian
    Konet, Amanda
    Booth, Graham
    Chew, Robert
    BMJ EVIDENCE-BASED MEDICINE, 2024,
  • [26] Enhancing Relation Extraction Through Augmented Data: Large Language Models Unleashed
    Ali, Manzoor
    Nisar, Muhammad Sohail
    Saleem, Muhammad
    Moussallem, Diego
    Ngomo, Axel-Cyrille Ngonga
    NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, PT II, NLDB 2024, 2024, 14763 : 68 - 78
  • [27] Collaborative large language models for automated data extraction in living systematic reviews
    Khan, Muhammad Ali
    Ayub, Umair
    Naqvi, Syed Arsalan Ahmed
    Khakwani, Kaneez Zahra Rubab
    Sipra, Zaryab bin Riaz
    Raina, Ammad
    Zhou, Sihan
    He, Huan
    Saeidi, Amir
    Hasan, Bashar
    Rumble, Robert Bryan
    Bitterman, Danielle S.
    Warner, Jeremy L.
    Zou, Jia
    Tevaarwerk, Amye J.
    Leventakos, Konstantinos
    Kehl, Kenneth L.
    Palmer, Jeanne M.
    Murad, Mohammad Hassan
    Baral, Chitta
    bin Riaz, Irbaz
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2025,
  • [28] Event extraction based on self-data augmentation with large language models
    Yang, Lishan
    Fan, Xi
    Wang, Xiangyu
    Wang, Xin
    Chen, Qiuju
    MEMETIC COMPUTING, 2025, 17 (01)
  • [29] Large Language Models as a Rapid and Objective Tool for Pathology Report Data Extraction
    Bolat, Beyza
    Eren, Ozgur Can
    Karasayar, A. Humeyra Dur
    Mericoz, Cisel Aydin
    Gunduz-Demir, Cigdem
    Kulac, Ibrahim
    TURKISH JOURNAL OF PATHOLOGY, 2024, 40 (02) : 138 - 141
  • [30] Harnessing large language models for data-scarce learning of polymer properties
    Liu, Ning
    Jafarzadeh, Siavash
    Lattimer, Brian Y.
    Ni, Shuna
    Lua, Jim
    Yu, Yue
    NATURE COMPUTATIONAL SCIENCE, 2025, : 245 - 254