Data extraction from polymer literature using large language models

被引:2
|
作者
Gupta, Sonakshi [1 ]
Mahmood, Akhlak [2 ]
Shetty, Pranav [1 ]
Adeboye, Aishat [3 ]
Ramprasad, Rampi [2 ]
机构
[1] Georgia Inst Technol, Sch Computat Sci & Engn, Atlanta, GA USA
[2] Georgia Inst Technol, Sch Mat Sci & Engn, Atlanta, GA 30332 USA
[3] Georgia Inst Technol, Sch Chem & Biomol Engn, Atlanta, GA USA
关键词
Natural language processing systems;
D O I
10.1038/s43246-024-00708-9
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
Automated data extraction from materials science literature at scale using artificial intelligence and natural language processing techniques is critical to advance materials discovery. However, this process for large spans of text continues to be a challenge due to the specific nature and styles of scientific manuscripts. In this study, we present a framework to automatically extract polymer-property data from full-text journal articles using commercially available (GPT-3.5) and open-source (LlaMa 2) large language models (LLM), in tandem with the named entity recognition (NER)-based MaterialsBERT model. Leveraging a corpus of similar to 2.4 million full text articles, our method successfully identified and processed around 681,000 polymer-related articles, resulting in the extraction of over one million records corresponding to 24 properties of over 106,000 unique polymers. We additionally conducted an extensive evaluation of the performance and associated costs of the LLMs used for data extraction, compared to the NER model. We suggest methodologies to optimize costs, provide insights on effective inference via in-context few-shots learning, and illuminate gaps and opportunities for future studies utilizing LLMs for natural language processing in polymer science. The extracted polymer-property data has been made publicly available for the wider scientific community via the Polymer Scholar website.
引用
收藏
页数:11
相关论文
共 50 条
  • [41] GeDa: Improving training data with large language models for Aspect Sentiment Triplet Extraction
    Mai, Weixing
    Zhang, Zhengxuan
    Chen, Yifan
    Li, Kuntao
    Xue, Yun
    KNOWLEDGE-BASED SYSTEMS, 2024, 301
  • [42] Decomposing Relational Triple Extraction with Large Language Models for Better Generalization on Unseen Data
    Meng, Boyu
    Lin, Tianhe
    Yang, Deqing
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PT IV, PAKDD 2024, 2024, 14648 : 104 - 115
  • [43] Literature Hunter: Literature Reading Aided by Large Language Models
    Lai, Yahao
    Chen, Xiang
    Du, Yunchen
    Liu, Bo
    Zhao, Shaofeng
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT V, NLPCC 2024, 2025, 15363 : 331 - 341
  • [44] Extracting chemical food safety hazards from the scientific literature automatically using large language models
    Ozen, Neris
    Mu, Wenjuan
    Asselt, Esther D. van
    van den Bulk, Leonieke M.
    APPLIED FOOD RESEARCH, 2025, 5 (01):
  • [45] Generating Data for Symbolic Language with Large Language Models
    Ye, Jiacheng
    Li, Chengzu
    Kong, Lingpeng
    Yu, Tao
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 8418 - 8443
  • [46] Using Large Language Models to Retrieve Critical Data from Clinical Processes and Business Rules
    Yu, Yunguo
    Gomez-Cabello, Cesar A.
    Makarova, Svetlana
    Parte, Yogesh
    Borna, Sahar
    Haider, Syed Ali
    Genovese, Ariana
    Prabha, Srinivasagam
    Forte, Antonio J.
    BIOENGINEERING-BASEL, 2025, 12 (01):
  • [47] Characterizing Spin in Psychiatric Clinical Research Literature Using Large Language Models
    Perlis, Roy H.
    JAMA NETWORK OPEN, 2025, 8 (02)
  • [48] Understanding Sarcoidosis Using Large Language Models and Social Media Data
    Xi, Nan Miles
    Ji, Hong-Long
    Wang, Lin
    JOURNAL OF HEALTHCARE INFORMATICS RESEARCH, 2024,
  • [49] Large language models for generative information extraction: a survey
    Xu, Derong
    Chen, Wei
    Peng, Wenjun
    Zhang, Chao
    Xu, Tong
    Zhao, Xiangyu
    Wu, Xian
    Zheng, Yefeng
    Wang, Yang
    Chen, Enhong
    FRONTIERS OF COMPUTER SCIENCE, 2024, 18 (06)
  • [50] Revisiting Relation Extraction in the era of Large Language Models
    Wadhwa, Somin
    Amir, Silvio
    Wallace, Byron C.
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 15566 - 15589