Autonomous data extraction from peer reviewed literature for training machine learning models of oxidation potentials

被引:3
|
作者
Lee, Siwoo [1 ]
Heinen, Stefan [2 ]
Khan, Danish [2 ]
von Lilienfeld, O. Anatole [1 ,2 ,3 ,4 ,5 ,6 ,7 ]
机构
[1] Univ Toronto, Dept Chem, St George Campus, Toronto, ON, Canada
[2] Vector Inst Artificial Intelligence, Toronto, ON M5S 1M1, Canada
[3] Univ Toronto, Accelerat Consortium, 80 St George St, Toronto, ON M5S 3H6, Canada
[4] Univ Toronto, Dept Mat Sci & Engn, St George Campus, Toronto, ON, Canada
[5] Univ Toronto, Dept Phys, St George Campus, Toronto, ON, Canada
[6] Tech Univ Berlin, Machine Learning Grp, Berlin, Germany
[7] Berlin Inst Fdn Learning & Data, Berlin, Germany
来源
基金
欧洲研究理事会;
关键词
quantum chemistry; machine learning; oxidation potentials; large language model; literature data extraction; RATIONAL DESIGN; ENERGY; DERIVATIVES; CHEMISTRY; LANGUAGE; IMPLICIT;
D O I
10.1088/2632-2153/ad2f52
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present an automated data-collection pipeline involving a convolutional neural network and a large language model to extract user-specified tabular data from peer-reviewed literature. The pipeline is applied to 74 reports published between 1957 and 2014 with experimentally-measured oxidation potentials for 592 organic molecules (-0.75 to 3.58 V). After data curation (solvents, reference electrodes, and missed data points), we trained multiple supervised machine learning (ML) models reaching prediction errors similar to experimental uncertainty (similar to 0.2 V). For experimental measurements of identical molecules reported in multiple studies, we identified the most likely value based on out-of-sample ML predictions. Using the trained ML models, we then estimated oxidation potentials of similar to 132k small organic molecules from the QM9 (quantum mechanics data for organic molecules with up to 9 atoms not counting hydrogens) data set, with predicted values spanning 0.21-3.46 V. Analysis of the QM9 predictions in terms of plausible descriptor-property trends suggests that aliphaticity increases the oxidation potential of an organic molecule on average from similar to 1.5 V to similar to 2 V, while an increase in number of heavy atoms lowers it systematically. The pipeline introduced offers significant reductions in human labor otherwise required for conventional manual data collection of experimental results, and exemplifies how to accelerate scientific research through automation.
引用
收藏
页数:12
相关论文
共 50 条
  • [21] Learning EPON delay models from data: a machine learning approach
    Alberto Hernandez, Jose
    Ebrahimzadeh, Amin
    Maier, Martin
    Larrabeiti, David
    JOURNAL OF OPTICAL COMMUNICATIONS AND NETWORKING, 2021, 13 (12) : 322 - 330
  • [22] Generalization in quantum machine learning from few training data
    Caro, Matthias C.
    Huang, Hsin-Yuan
    Cerezo, M.
    Sharma, Kunal
    Sornborger, Andrew
    Cincio, Lukasz
    Coles, Patrick J.
    NATURE COMMUNICATIONS, 2022, 13 (01)
  • [23] Generalization in quantum machine learning from few training data
    Matthias C. Caro
    Hsin-Yuan Huang
    M. Cerezo
    Kunal Sharma
    Andrew Sornborger
    Lukasz Cincio
    Patrick J. Coles
    Nature Communications, 13
  • [24] Statistics and machine learning methods for EHR data - from data extraction to data analytics
    Kundu, Madan G.
    JOURNAL OF BIOPHARMACEUTICAL STATISTICS, 2021, 31 (04) : 559 - 560
  • [25] Pipelines for Procedural Information Extraction from Scientific Literature: Towards Recipes using Machine Learning and Data Science
    Yang, Huichen
    Aguirre, Carlos A.
    De La Torre, Maria F.
    Christensen, Derek
    Bobadilla, Luis
    Davich, Emily
    Roth, Jordan
    Luo, Lei
    Theis, Yihong
    Lam, Alice
    Han, T. Yong-Jin
    Buttler, David
    Hsu, William H.
    2019 INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION WORKSHOPS (ICDARW) AND 2ND INTERNATIONAL WORKSHOP ON OPEN SERVICES AND TOOLS FOR DOCUMENT ANALYSIS (OST), VOL 2, 2019, : 41 - 46
  • [26] Data-Efficient Multifidelity Training for High-Fidelity Machine Learning Interatomic Potentials
    Kim, Jaesun
    Kim, Jisu
    Kim, Jaehoon
    Lee, Jiho
    Park, Yutack
    Kang, Youngho
    Han, Seungwu
    JOURNAL OF THE AMERICAN CHEMICAL SOCIETY, 2024, 147 (01) : 1042 - 1054
  • [27] Learning from models: high-dimensional analyses on the performance of machine learning interatomic potentials
    Liu, Yunsheng
    Mo, Yifei
    NPJ COMPUTATIONAL MATERIALS, 2024, 10 (01)
  • [28] Tourism-Related Placeness Feature Extraction From Social Media Data Using Machine Learning Models
    Munoz, P.
    Donaque, E.
    Larranaga, A.
    Martinez, J.
    Mejias, A.
    INTERNATIONAL JOURNAL OF INTERACTIVE MULTIMEDIA AND ARTIFICIAL INTELLIGENCE, 2023, 8 (04): : 176 - 181
  • [29] Data extraction from polymer literature using large language models
    Gupta, Sonakshi
    Mahmood, Akhlak
    Shetty, Pranav
    Adeboye, Aishat
    Ramprasad, Rampi
    COMMUNICATIONS MATERIALS, 2024, 5 (01)
  • [30] Generating models of mental retardation from data with machine learning
    Mani, S
    McDermott, S
    Pazzani, MJ
    1997 IEEE KNOWLEDGE AND DATA ENGINEERING EXCHANGE WORKSHOP, PROCEEDINGS, 1997, : 114 - 119