Autonomous data extraction from peer reviewed literature for training machine learning models of oxidation potentials

被引:3
|
作者
Lee, Siwoo [1 ]
Heinen, Stefan [2 ]
Khan, Danish [2 ]
von Lilienfeld, O. Anatole [1 ,2 ,3 ,4 ,5 ,6 ,7 ]
机构
[1] Univ Toronto, Dept Chem, St George Campus, Toronto, ON, Canada
[2] Vector Inst Artificial Intelligence, Toronto, ON M5S 1M1, Canada
[3] Univ Toronto, Accelerat Consortium, 80 St George St, Toronto, ON M5S 3H6, Canada
[4] Univ Toronto, Dept Mat Sci & Engn, St George Campus, Toronto, ON, Canada
[5] Univ Toronto, Dept Phys, St George Campus, Toronto, ON, Canada
[6] Tech Univ Berlin, Machine Learning Grp, Berlin, Germany
[7] Berlin Inst Fdn Learning & Data, Berlin, Germany
来源
基金
欧洲研究理事会;
关键词
quantum chemistry; machine learning; oxidation potentials; large language model; literature data extraction; RATIONAL DESIGN; ENERGY; DERIVATIVES; CHEMISTRY; LANGUAGE; IMPLICIT;
D O I
10.1088/2632-2153/ad2f52
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present an automated data-collection pipeline involving a convolutional neural network and a large language model to extract user-specified tabular data from peer-reviewed literature. The pipeline is applied to 74 reports published between 1957 and 2014 with experimentally-measured oxidation potentials for 592 organic molecules (-0.75 to 3.58 V). After data curation (solvents, reference electrodes, and missed data points), we trained multiple supervised machine learning (ML) models reaching prediction errors similar to experimental uncertainty (similar to 0.2 V). For experimental measurements of identical molecules reported in multiple studies, we identified the most likely value based on out-of-sample ML predictions. Using the trained ML models, we then estimated oxidation potentials of similar to 132k small organic molecules from the QM9 (quantum mechanics data for organic molecules with up to 9 atoms not counting hydrogens) data set, with predicted values spanning 0.21-3.46 V. Analysis of the QM9 predictions in terms of plausible descriptor-property trends suggests that aliphaticity increases the oxidation potential of an organic molecule on average from similar to 1.5 V to similar to 2 V, while an increase in number of heavy atoms lowers it systematically. The pipeline introduced offers significant reductions in human labor otherwise required for conventional manual data collection of experimental results, and exemplifies how to accelerate scientific research through automation.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Protecting Machine Learning Models from Training Data Set Extraction
    Kalinin, M. O.
    Muryleva, A. A.
    Platonov, V. V.
    AUTOMATIC CONTROL AND COMPUTER SCIENCES, 2024, 58 (08) : 1234 - 1241
  • [2] Efficient and Accurate Peer-to-Peer Training of Machine Learning Based Home Thermal Models
    Boubouh, Karim
    Basmadjian, Robert
    Ardakanian, Omid
    Maurer, Alexandre
    Guerraoui, Rachid
    PROCEEDINGS OF THE 2023 THE 14TH ACM INTERNATIONAL CONFERENCE ON FUTURE ENERGY SYSTEMS, E-ENERGY 2023, 2023, : 524 - 529
  • [3] Machine learning to automate data extraction for systematic literature reviews
    Panayi, Antonia
    Ortiz, Juan M.
    Ward, Katherine
    Lopez, Antonio Santiago Ibanez
    Xia, Andrew
    Barzilay, Regina
    PHARMACOEPIDEMIOLOGY AND DRUG SAFETY, 2022, 31 : 301 - 301
  • [4] Training machine learning potentials for reactive systems: A Colab tutorial on basic models
    Pan, Xiaoliang
    Snyder, Ryan
    Wang, Jia-Ning
    Lander, Chance
    Wickizer, Carly
    Van, Richard
    Chesney, Andrew
    Xue, Yuanfei
    Mao, Yuezhi
    Mei, Ye
    Pu, Jingzhi
    Shao, Yihan
    JOURNAL OF COMPUTATIONAL CHEMISTRY, 2024, 45 (10) : 638 - 647
  • [5] Geospatial data for peer-to-peer communication among autonomous vehicles using optimized machine learning algorithm
    Aruna, T. M.
    Kumar, Piyush
    Naresh, E.
    Divyaraj, G. N.
    Asha, K.
    Thirumalraj, Arunadevi
    Srinidhi, N. N.
    Yadav, Arunkumar
    SCIENTIFIC REPORTS, 2024, 14 (01):
  • [6] From Peer-Reviewed to Peer-Reproduced in Scholarly Publishing: The Complementary Roles of Data Models and Workflows in Bioinformatics
    Gonzalez-Beltran, Alejandra
    Li, Peter
    Zhao, Jun
    Avila-Garcia, Maria Susana
    Roos, Marco
    Thompson, Mark
    van der Horst, Eelke
    Kaliyaperumal, Rajaram
    Luo, Ruibang
    Lee, Tin-Lap
    Lam, Tak-wah
    Edmunds, Scott C.
    Sansone, Susanna-Assunta
    Rocca-Serra, Philippe
    PLOS ONE, 2015, 10 (07):
  • [7] Machine Learning Classical Interatomic Potentials for Molecular Dynamics from First-Principles Training Data
    Chan, Henry
    Narayanan, Badri
    Cherukara, Mathew J.
    Sen, Fatih G.
    Sasikumar, Kiran
    Gray, Stephen K.
    Chan, Maria K. Y.
    Sankaranarayanan, Subramanian K. R. S.
    JOURNAL OF PHYSICAL CHEMISTRY C, 2019, 123 (12): : 6941 - 6957
  • [8] The Cost of Training Machine Learning Models Over Distributed Data Sources
    Guerra, Elia
    Wilhelmi, Francesc
    Miozzo, Marco
    Dini, Paolo
    IEEE OPEN JOURNAL OF THE COMMUNICATIONS SOCIETY, 2023, 4 : 1111 - 1126
  • [9] A probabilistic approach to training machine learning models using noisy data
    Alzraiee, Ayman H.
    Niswonger, Richard G.
    ENVIRONMENTAL MODELLING & SOFTWARE, 2024, 179
  • [10] When Machine Learning Models Leak: An Exploration of Synthetic Training Data
    Slokom, Manel
    De Wolf, Peter-Paul
    Larson, Martha
    PRIVACY IN STATISTICAL DATABASES, PSD 2022, 2022, 13463 : 283 - 296