Large-scale chemical language representations capture molecular structure and properties

被引:128
|
作者
Ross, Jerret [1 ]
Belgodere, Brian [1 ]
Chenthamarakshan, Vijil [1 ]
Padhi, Inkit [1 ]
Mroueh, Youssef [1 ]
Das, Payel [1 ]
机构
[1] IBM Res, Yorktown Hts, NY 10598 USA
关键词
Computational linguistics - Graph neural networks - Molecules - Natural language processing systems - Quantum chemistry - Supervised learning;
D O I
10.1038/s42256-022-00580-7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large language models have recently emerged with extraordinary capabilities, and these methods can be applied to model other kinds of sequence, such as string representations of molecules. Ross and colleagues have created a transformer-based model, trained on a large dataset of molecules, which provides good results on property prediction tasks. Models based on machine learning can enable accurate and fast molecular property predictions, which is of interest in drug discovery and material design. Various supervised machine learning models have demonstrated promising performance, but the vast chemical space and the limited availability of property labels make supervised learning challenging. Recently, unsupervised transformer-based language models pretrained on a large unlabelled corpus have produced state-of-the-art results in many downstream natural language processing tasks. Inspired by this development, we present molecular embeddings obtained by training an efficient transformer encoder model, MoLFormer, which uses rotary positional embeddings. This model employs a linear attention mechanism, coupled with highly distributed training, on SMILES sequences of 1.1 billion unlabelled molecules from the PubChem and ZINC datasets. We show that the learned molecular representation outperforms existing baselines, including supervised and self-supervised graph neural networks and language models, on several downstream tasks from ten benchmark datasets. They perform competitively on two others. Further analyses, specifically through the lens of attention, demonstrate that MoLFormer trained on chemical SMILES indeed learns the spatial relationships between atoms within a molecule. These results provide encouraging evidence that large-scale molecular language models can capture sufficient chemical and structural information to predict various distinct molecular properties, including quantum-chemical properties.
引用
收藏
页码:1256 / 1264
页数:13
相关论文
共 50 条
  • [1] Large-scale chemical language representations capture molecular structure and properties
    Jerret Ross
    Brian Belgodere
    Vijil Chenthamarakshan
    Inkit Padhi
    Youssef Mroueh
    Payel Das
    Nature Machine Intelligence, 2022, 4 : 1256 - 1264
  • [2] MULTISCALING PROPERTIES OF LARGE-SCALE STRUCTURE IN THE UNIVERSE
    MARTINEZ, VJ
    PAREDES, S
    BORGANI, S
    COLES, P
    SCIENCE, 1995, 269 (5228) : 1245 - 1247
  • [3] Biochar data into structure: A methodology for generating large-scale atomistic representations
    Sierra-Jimenez, Valentina
    Mathews, Jonathan P.
    Yoo, Pilsun
    Budai, Alice
    Chejne, Farid
    Dufour, Anthony
    Garcia-Perez, Manuel
    CARBON, 2024, 228
  • [4] LARGE-SCALE WAVE STRUCTURE IN ORION MOLECULAR CLOUD
    PHILLIPS, TG
    JEFFERTS, KB
    WANNIER, PG
    ADE, PAR
    ASTROPHYSICAL JOURNAL, 1974, 191 (01): : L31 - L32
  • [5] Chemical structure effects on coal pyrolyzates and reactions by using large-scale reactive molecular dynamics
    Zheng, Mo
    Li, Xiaoxia
    Bai, Jin
    Guo, Li
    Fuel, 2022, 327
  • [6] Molecular Structure-Based Large-Scale Prediction of Chemical Induced Gene Expression Changes
    Liu, Ruifeng
    AbdulHameed, Mohamed Diwan M.
    Wallqvist, Anders
    JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2017, 57 (09) : 2194 - 2202
  • [7] Chemical structure effects on coal pyrolyzates and reactions by using large-scale reactive molecular dynamics
    Zheng, Mo
    Li, Xiaoxia
    Bai, Jin
    Guo, Li
    FUEL, 2022, 327
  • [8] Language matters: representations of 'heart failure' in English discourse - a large-scale linguistic study
    Demmen, Jane
    Hartshorne-Evans, Nick
    Semino, Elena
    Sankaranarayanan, Rajiv
    OPEN HEART, 2022, 9 (01):
  • [9] A Large-Scale Database for Chemical Structure Recognition and Preliminary Evaluation
    Ding, Longfei
    Zhao, Mengbiao
    Yin, Fei
    Zeng, Shuiling
    Liu, Cheng-Lin
    2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 1464 - 1470
  • [10] Readable representations for large-scale bipartite graphs
    Sato, Shuji
    Misue, Kazuo
    Tanaka, Jiro
    KNOWLEDGE-BASED INTELLIGENT INFORMATION AND ENGINEERING SYSTEMS, PT 2, PROCEEDINGS, 2008, 5178 : 831 - 838