Autonomous data extraction from peer reviewed literature for training machine learning models of oxidation potentials

被引:3
|
作者
Lee, Siwoo [1 ]
Heinen, Stefan [2 ]
Khan, Danish [2 ]
von Lilienfeld, O. Anatole [1 ,2 ,3 ,4 ,5 ,6 ,7 ]
机构
[1] Univ Toronto, Dept Chem, St George Campus, Toronto, ON, Canada
[2] Vector Inst Artificial Intelligence, Toronto, ON M5S 1M1, Canada
[3] Univ Toronto, Accelerat Consortium, 80 St George St, Toronto, ON M5S 3H6, Canada
[4] Univ Toronto, Dept Mat Sci & Engn, St George Campus, Toronto, ON, Canada
[5] Univ Toronto, Dept Phys, St George Campus, Toronto, ON, Canada
[6] Tech Univ Berlin, Machine Learning Grp, Berlin, Germany
[7] Berlin Inst Fdn Learning & Data, Berlin, Germany
来源
基金
欧洲研究理事会;
关键词
quantum chemistry; machine learning; oxidation potentials; large language model; literature data extraction; RATIONAL DESIGN; ENERGY; DERIVATIVES; CHEMISTRY; LANGUAGE; IMPLICIT;
D O I
10.1088/2632-2153/ad2f52
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present an automated data-collection pipeline involving a convolutional neural network and a large language model to extract user-specified tabular data from peer-reviewed literature. The pipeline is applied to 74 reports published between 1957 and 2014 with experimentally-measured oxidation potentials for 592 organic molecules (-0.75 to 3.58 V). After data curation (solvents, reference electrodes, and missed data points), we trained multiple supervised machine learning (ML) models reaching prediction errors similar to experimental uncertainty (similar to 0.2 V). For experimental measurements of identical molecules reported in multiple studies, we identified the most likely value based on out-of-sample ML predictions. Using the trained ML models, we then estimated oxidation potentials of similar to 132k small organic molecules from the QM9 (quantum mechanics data for organic molecules with up to 9 atoms not counting hydrogens) data set, with predicted values spanning 0.21-3.46 V. Analysis of the QM9 predictions in terms of plausible descriptor-property trends suggests that aliphaticity increases the oxidation potential of an organic molecule on average from similar to 1.5 V to similar to 2 V, while an increase in number of heavy atoms lowers it systematically. The pipeline introduced offers significant reductions in human labor otherwise required for conventional manual data collection of experimental results, and exemplifies how to accelerate scientific research through automation.
引用
收藏
页数:12
相关论文
共 50 条
  • [41] Improving the accuracy of machine-learning models with data from machine test repetitions
    Andres Bustillo
    Roberto Reis
    Alisson R. Machado
    Danil Yu. Pimenov
    Journal of Intelligent Manufacturing, 2022, 33 : 203 - 221
  • [42] Factors influencing the adoption of big data in libraries: a systematic literature review of peer-reviewed articles from 2013 to 2023
    Shahzad, Khurram
    Khan, Shakeel Ahmad
    ELECTRONIC LIBRARY, 2024, 42 (05): : 722 - 740
  • [43] Shifting machine learning for healthcare from development to deployment and from models to data
    Zhang, Angela
    Xing, Lei
    Zou, James
    Wu, Joseph C.
    NATURE BIOMEDICAL ENGINEERING, 2022, 6 (12) : 1330 - 1345
  • [44] Shifting machine learning for healthcare from development to deployment and from models to data
    Angela Zhang
    Lei Xing
    James Zou
    Joseph C. Wu
    Nature Biomedical Engineering, 2022, 6 : 1330 - 1345
  • [45] Materials Synthesis Insights from Scientific Literature via Text Extraction and Machine Learning
    Kim, Edward
    Huang, Kevin
    Saunders, Adam
    McCallum, Andrew
    Ceder, Gerbrand
    Olivetti, Elsa
    CHEMISTRY OF MATERIALS, 2017, 29 (21) : 9436 - 9444
  • [46] From data to interpretable models: machine learning for soil moisture forecasting
    Aniruddha Basak
    Kevin M. Schmidt
    Ole Jakob Mengshoel
    International Journal of Data Science and Analytics, 2023, 15 : 9 - 32
  • [47] From data to interpretable models: machine learning for soil moisture forecasting
    Basak, Aniruddha
    Schmidt, Kevin M.
    Mengshoel, Ole Jakob
    INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS, 2023, 15 (01) : 9 - 32
  • [48] Predicting the Shape of Corneas from Clinical Data with Machine Learning Models
    Bouazizi, Hala
    Brunette, Isabelle
    Meunier, Jean
    IRBM, 2024, 45 (05)
  • [49] Drug Disease Relation Extraction from Biomedical Literature Using NLP and Machine Learning
    Ben Abdessalem Karaa, Wahiba
    Alkhammash, Eman H.
    Bchir, Aida
    MOBILE INFORMATION SYSTEMS, 2021, 2021
  • [50] Predicting Extraction Selectivity of Acetic Acid in Pervaporation by Machine Learning Models with Data Leakage Management
    Yang, Meiqi
    Zhu, Jun-Jie
    McGaughey, Allyson
    Zheng, Sunxiang
    Priestley, Rodney D.
    Ren, Zhiyong Jason
    ENVIRONMENTAL SCIENCE & TECHNOLOGY, 2023, 57 (14) : 5934 - 5946