VAE-Sim: A Novel Molecular Similarity Measure Based on a Variational Autoencoder

被引:27
|
作者
Samanta, Soumitra [1 ]
O'Hagan, Steve [2 ,4 ]
Swainston, Neil [1 ]
Roberts, Timothy J. [1 ]
Kell, Douglas B. [1 ,3 ]
机构
[1] Univ Liverpool, Inst Syst Mol & Integrat Biol, Dept Biochem & Syst Biol, Crown St, Liverpool L69 7ZB, Merseyside, England
[2] Univ Manchester, Manchester Inst Biotechnol, Dept Chem, 131 Princess St, Manchester M1 7DN, Lancs, England
[3] Tech Univ Denmark, Novo Nordisk Fdn Ctr Biosustainabil, Bldg 220, DK-2800 Lyngby, Denmark
[4] Univ Coll London Hosp NHS Fdn Trust, 250 Euston Rd, London NW1 2PB, England
来源
MOLECULES | 2020年 / 25卷 / 15期
基金
英国生物技术与生命科学研究理事会; 英国工程与自然科学研究理事会;
关键词
cheminformatics; molecular similarity; deep learning; variational autoencoder; SMILES; PYROLYSIS MASS-SPECTROMETRY; DRUG DISCOVERY; MARKETED DRUGS; DESIGN; DESCRIPTORS; FINGERPRINTS; NETWORKS; REPRESENTATION; PROMISCUITY; FOUNDATIONS;
D O I
10.3390/molecules25153446
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Molecular similarity is an elusive but core "unsupervised" cheminformatics concept, yet different "fingerprint" encodings of molecular structures return very different similarity values, even when using the same similarity metric. Each encoding may be of value when applied to other problems with objective or target functions, implying that a priori none are "better" than the others, nor than encoding-free metrics such as maximum common substructure (MCSS). We here introduce a novel approach to molecular similarity, in the form of a variational autoencoder (VAE). This learns the joint distribution p(z vertical bar x) where z is a latent vector and x are the (same) input/output data. It takes the form of a "bowtie"-shaped artificial neural network. In the middle is a "bottleneck layer" or latent vector in which inputs are transformed into, and represented as, a vector of numbers (encoding), with a reverse process (decoding) seeking to return the SMILES string that was the input. We train a VAE on over six million druglike molecules and natural products (including over one million in the final holdout set). The VAE vector distances provide a rapid and novel metric for molecular similarity that is both easily and rapidly calculated. We describe the method and its application to a typical similarity problem in cheminformatics.
引用
收藏
页数:16
相关论文
共 50 条
  • [31] VAE-TALSTM: a temporal attention and variational autoencoder-based long short-term memory framework for dam displacement prediction
    Xiaosong Shu
    Tengfei Bao
    Yangtao Li
    Jian Gong
    Kang Zhang
    Engineering with Computers, 2022, 38 : 3497 - 3512
  • [32] Extensive framework based on novel convolutional and variational autoencoder based on maximization of mutual information for anomaly detection
    Qien Yu
    Muthu Subash Kavitha
    Takio Kurita
    Neural Computing and Applications, 2021, 33 : 13785 - 13807
  • [33] Extensive framework based on novel convolutional and variational autoencoder based on maximization of mutual information for anomaly detection
    Yu, Qien
    Kavitha, Muthusubash
    Kurita, Takio
    NEURAL COMPUTING & APPLICATIONS, 2021, 33 (20): : 13785 - 13807
  • [34] Feature trees: A new molecular similarity measure based on tree matching
    Matthias Rarey
    J. Scott Dixon
    Journal of Computer-Aided Molecular Design, 1998, 12 : 471 - 490
  • [35] Feature trees: A new molecular similarity measure based on tree matching
    Rarey, M
    Dixon, JS
    JOURNAL OF COMPUTER-AIDED MOLECULAR DESIGN, 1998, 12 (05) : 471 - 490
  • [36] Generative model based on junction tree variational autoencoder for HOMO value prediction and molecular optimization
    Kondratyev, Vladimir
    Dryzhakov, Marian
    Gimadiev, Timur
    Slutskiy, Dmitriy
    JOURNAL OF CHEMINFORMATICS, 2023, 15 (01)
  • [37] Generative model based on junction tree variational autoencoder for HOMO value prediction and molecular optimization
    Vladimir Kondratyev
    Marian Dryzhakov
    Timur Gimadiev
    Dmitriy Slutskiy
    Journal of Cheminformatics, 15
  • [38] XIDINTFL-VAE: XGBoost-based intrusion detection of imbalance network traffic via class-wise focal loss variational autoencoder
    Abdulganiyu, Oluwadamilare Harazeem
    Tchakoucht, Taha Ait
    Saheed, Yakub Kayode
    Ahmed, Hilali Alaoui
    JOURNAL OF SUPERCOMPUTING, 2025, 81 (01):
  • [39] A Novel Data Dependent Similarity Measure Algorithm Based on Attribute Selection
    Gao, Zhipeng
    Deng, Nanjie
    Niu, Kun
    2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP), 2018, : 603 - 606
  • [40] A novel travel-time based similarity measure for hierarchical clustering
    Lu, Yonggang
    Hou, Xiaoli
    Chen, Xurong
    NEUROCOMPUTING, 2016, 173 : 3 - 8