Autonomous data extraction from peer reviewed literature for training machine learning models of oxidation potentials

被引：3

作者：

Lee, Siwoo ^{[1
]}

Heinen, Stefan ^{[2
]}

Khan, Danish ^{[2
]}

von Lilienfeld, O. Anatole ^{[1
,2
,3
,4
,5
,6
,7
]}

机构：

[1] Univ Toronto, Dept Chem, St George Campus, Toronto, ON, Canada

[2] Vector Inst Artificial Intelligence, Toronto, ON M5S 1M1, Canada

[3] Univ Toronto, Accelerat Consortium, 80 St George St, Toronto, ON M5S 3H6, Canada

[4] Univ Toronto, Dept Mat Sci & Engn, St George Campus, Toronto, ON, Canada

[5] Univ Toronto, Dept Phys, St George Campus, Toronto, ON, Canada

[6] Tech Univ Berlin, Machine Learning Grp, Berlin, Germany

[7] Berlin Inst Fdn Learning & Data, Berlin, Germany

来源：

MACHINE LEARNING-SCIENCE AND TECHNOLOGY | 2024年 / 5卷 / 01期

基金：

欧洲研究理事会;

关键词：

quantum chemistry; machine learning; oxidation potentials; large language model; literature data extraction; RATIONAL DESIGN; ENERGY; DERIVATIVES; CHEMISTRY; LANGUAGE; IMPLICIT;

D O I：

10.1088/2632-2153/ad2f52

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We present an automated data-collection pipeline involving a convolutional neural network and a large language model to extract user-specified tabular data from peer-reviewed literature. The pipeline is applied to 74 reports published between 1957 and 2014 with experimentally-measured oxidation potentials for 592 organic molecules (-0.75 to 3.58 V). After data curation (solvents, reference electrodes, and missed data points), we trained multiple supervised machine learning (ML) models reaching prediction errors similar to experimental uncertainty (similar to 0.2 V). For experimental measurements of identical molecules reported in multiple studies, we identified the most likely value based on out-of-sample ML predictions. Using the trained ML models, we then estimated oxidation potentials of similar to 132k small organic molecules from the QM9 (quantum mechanics data for organic molecules with up to 9 atoms not counting hydrogens) data set, with predicted values spanning 0.21-3.46 V. Analysis of the QM9 predictions in terms of plausible descriptor-property trends suggests that aliphaticity increases the oxidation potential of an organic molecule on average from similar to 1.5 V to similar to 2 V, while an increase in number of heavy atoms lowers it systematically. The pipeline introduced offers significant reductions in human labor otherwise required for conventional manual data collection of experimental results, and exemplifies how to accelerate scientific research through automation.

引用

页数：12

共 50 条

[21] Learning EPON delay models from data: a machine learning approach
Alberto Hernandez, Jose
Ebrahimzadeh, Amin
Maier, Martin
Larrabeiti, David
JOURNAL OF OPTICAL COMMUNICATIONS AND NETWORKING, 2021, 13 (12) : 322 - 330
[22] Generalization in quantum machine learning from few training data
Caro, Matthias C.
Huang, Hsin-Yuan
Cerezo, M.
Sharma, Kunal
Sornborger, Andrew
Cincio, Lukasz
Coles, Patrick J.
NATURE COMMUNICATIONS, 2022, 13 (01)
[23] Generalization in quantum machine learning from few training data
Matthias C. Caro
Hsin-Yuan Huang
M. Cerezo
Kunal Sharma
Andrew Sornborger
Lukasz Cincio
Patrick J. Coles
Nature Communications, 13
[24] Statistics and machine learning methods for EHR data - from data extraction to data analytics
Kundu, Madan G.
JOURNAL OF BIOPHARMACEUTICAL STATISTICS, 2021, 31 (04) : 559 - 560
[25] Pipelines for Procedural Information Extraction from Scientific Literature: Towards Recipes using Machine Learning and Data Science
Yang, Huichen
Aguirre, Carlos A.
De La Torre, Maria F.
Christensen, Derek
Bobadilla, Luis
Davich, Emily
Roth, Jordan
Luo, Lei
Theis, Yihong
Lam, Alice
Han, T. Yong-Jin
Buttler, David
Hsu, William H.
2019 INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION WORKSHOPS (ICDARW) AND 2ND INTERNATIONAL WORKSHOP ON OPEN SERVICES AND TOOLS FOR DOCUMENT ANALYSIS (OST), VOL 2, 2019, : 41 - 46
[26] Data-Efficient Multifidelity Training for High-Fidelity Machine Learning Interatomic Potentials
Kim, Jaesun
Kim, Jisu
Kim, Jaehoon
Lee, Jiho
Park, Yutack
Kang, Youngho
Han, Seungwu
JOURNAL OF THE AMERICAN CHEMICAL SOCIETY, 2024, 147 (01) : 1042 - 1054
[27] Learning from models: high-dimensional analyses on the performance of machine learning interatomic potentials
Liu, Yunsheng
Mo, Yifei
NPJ COMPUTATIONAL MATERIALS, 2024, 10 (01)
[28] Tourism-Related Placeness Feature Extraction From Social Media Data Using Machine Learning Models
Munoz, P.
Donaque, E.
Larranaga, A.
Martinez, J.
Mejias, A.
INTERNATIONAL JOURNAL OF INTERACTIVE MULTIMEDIA AND ARTIFICIAL INTELLIGENCE, 2023, 8 (04): : 176 - 181
[29] Data extraction from polymer literature using large language models
Gupta, Sonakshi
Mahmood, Akhlak
Shetty, Pranav
Adeboye, Aishat
Ramprasad, Rampi
COMMUNICATIONS MATERIALS, 2024, 5 (01)
[30] Generating models of mental retardation from data with machine learning
Mani, S
McDermott, S
Pazzani, MJ
1997 IEEE KNOWLEDGE AND DATA ENGINEERING EXCHANGE WORKSHOP, PROCEEDINGS, 1997, : 114 - 119

← 1 2 3 4 5 →