VAE-Sim: A Novel Molecular Similarity Measure Based on a Variational Autoencoder

被引:27
|
作者
Samanta, Soumitra [1 ]
O'Hagan, Steve [2 ,4 ]
Swainston, Neil [1 ]
Roberts, Timothy J. [1 ]
Kell, Douglas B. [1 ,3 ]
机构
[1] Univ Liverpool, Inst Syst Mol & Integrat Biol, Dept Biochem & Syst Biol, Crown St, Liverpool L69 7ZB, Merseyside, England
[2] Univ Manchester, Manchester Inst Biotechnol, Dept Chem, 131 Princess St, Manchester M1 7DN, Lancs, England
[3] Tech Univ Denmark, Novo Nordisk Fdn Ctr Biosustainabil, Bldg 220, DK-2800 Lyngby, Denmark
[4] Univ Coll London Hosp NHS Fdn Trust, 250 Euston Rd, London NW1 2PB, England
来源
MOLECULES | 2020年 / 25卷 / 15期
基金
英国生物技术与生命科学研究理事会; 英国工程与自然科学研究理事会;
关键词
cheminformatics; molecular similarity; deep learning; variational autoencoder; SMILES; PYROLYSIS MASS-SPECTROMETRY; DRUG DISCOVERY; MARKETED DRUGS; DESIGN; DESCRIPTORS; FINGERPRINTS; NETWORKS; REPRESENTATION; PROMISCUITY; FOUNDATIONS;
D O I
10.3390/molecules25153446
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Molecular similarity is an elusive but core "unsupervised" cheminformatics concept, yet different "fingerprint" encodings of molecular structures return very different similarity values, even when using the same similarity metric. Each encoding may be of value when applied to other problems with objective or target functions, implying that a priori none are "better" than the others, nor than encoding-free metrics such as maximum common substructure (MCSS). We here introduce a novel approach to molecular similarity, in the form of a variational autoencoder (VAE). This learns the joint distribution p(z vertical bar x) where z is a latent vector and x are the (same) input/output data. It takes the form of a "bowtie"-shaped artificial neural network. In the middle is a "bottleneck layer" or latent vector in which inputs are transformed into, and represented as, a vector of numbers (encoding), with a reverse process (decoding) seeking to return the SMILES string that was the input. We train a VAE on over six million druglike molecules and natural products (including over one million in the final holdout set). The VAE vector distances provide a rapid and novel metric for molecular similarity that is both easily and rapidly calculated. We describe the method and its application to a typical similarity problem in cheminformatics.
引用
收藏
页数:16
相关论文
共 50 条
  • [21] MatchSim: a novel similarity measure based on maximum neighborhood matching
    Lin, Zhenjiang
    Lyu, Michael R.
    King, Irwin
    KNOWLEDGE AND INFORMATION SYSTEMS, 2012, 32 (01) : 141 - 166
  • [22] A NOVEL MUSIC SIMILARITY MEASURE SYSTEM BASED ON INSTRUMENTATION ANALYSIS
    Pei, Soo-Chang
    Hsu, Nien-Teh
    ICME: 2009 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOLS 1-3, 2009, : 470 - 473
  • [23] Novel Similarity Measure for Document Clustering Based on Topic Phrases
    ELdesoky, A. E.
    Saleh, M.
    Sakr, N. A.
    ICNM: 2009 INTERNATIONAL CONFERENCE ON NETWORKING & MEDIA CONVERGENCE, 2007, : 92 - +
  • [24] A novel affect-based model of similarity measure of videos
    Niu, Jianwei
    Zhao, Xiaoke
    Aziz, Muhammad Ali Abdul
    NEUROCOMPUTING, 2016, 173 : 339 - 345
  • [25] MatchSim: a novel similarity measure based on maximum neighborhood matching
    Zhenjiang Lin
    Michael R. Lyu
    Irwin King
    Knowledge and Information Systems, 2012, 32 : 141 - 166
  • [26] Semantic similarity is not enough: A novel NLP-based semantic similarity measure in context
    Abbasi, Omid Reza
    Alesheikh, Ali Asghar
    Lotfata, Aynaz
    ISCIENCE, 2024, 27 (06)
  • [27] A Novel Model for Ship Trajectory Anomaly Detection Based on Gaussian Mixture Variational Autoencoder
    Xie, Lei
    Guo, Tao
    Chang, Jiliang
    Wan, Chengpeng
    Hu, Xinyuan
    Yang, Yang
    Ou, Changkui
    IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, 2023, 72 (11) : 13826 - 13835
  • [28] VAE-TALSTM: a temporal attention and variational autoencoder-based long short-term memory framework for dam displacement prediction
    Shu, Xiaosong
    Bao, Tengfei
    Li, Yangtao
    Gong, Jian
    Zhang, Kang
    ENGINEERING WITH COMPUTERS, 2022, 38 (04) : 3497 - 3512
  • [29] Identification of Autism spectrum disorder based on a novel feature selection method and Variational Autoencoder
    Zhang, Fangyu
    Wei, Yanjie
    Liu, Jin
    Wang, Yanlin
    Xi, Wenhui
    Pan, Yi
    COMPUTERS IN BIOLOGY AND MEDICINE, 2022, 148
  • [30] A NOVEL VARIATIONAL AUTOENCODER BASED RADAR SIGNAL RECONSTRUCTION ALGORITHM USING POLLUTED DATA
    Jing, Zehuan
    Wu, Bin
    Li, Peng
    Yang, Rui
    Li, Jingyi
    Wang, Zhao
    IGARSS 2020 - 2020 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, 2020, : 2715 - 2718