Exploring the ability of machine learning-based virtual screening models to identify the functional groups responsible for binding

被引:2
|
作者
Hadfield, Thomas E. [1 ]
Scantlebury, Jack [1 ]
Deane, Charlotte M. [1 ]
机构
[1] Univ Oxford, Dept Stat, Oxford Prot Informat Grp, Oxford, England
关键词
Structure-based virtual screening; Machine learning; Interpretability; PROTEIN; DOCKING;
D O I
10.1186/s13321-023-00755-3
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Many recently proposed structure-based virtual screening models appear to be able to accurately distinguish high affinity binders from non-binders. However, several recent studies have shown that they often do so by exploiting ligand-specific biases in the dataset, rather than identifying favourable intermolecular interactions in the input protein-ligand complex. In this work we propose a novel approach for assessing the extent to which machine learning-based virtual screening models are able to identify the functional groups responsible for binding. To sidestep the difficulty in establishing the ground truth importance of each atom of a large scale set of protein-ligand complexes, we propose a protocol for generating synthetic data. Each ligand in the dataset is surrounded by a randomly sampled point cloud of pharmacophores, and the label assigned to the synthetic protein-ligand complex is determined by a 3-dimensional deterministic binding rule. This allows us to precisely quantify the ground truth importance of each atom and compare it to the model generated attributions. Using our generated datasets, we demonstrate that a recently proposed deep learning-based virtual screening model, PointVS, identified the most important functional groups with 39% more efficiency than a fingerprint-based random forest, suggesting that it would generalise more effectively to new examples. In addition, we found that ligand-specific biases, such as those present in widely used virtual screening datasets, substantially impaired the ability of all ML models to identify the most important functional groups. We have made our synthetic data generation framework available to facilitate the benchmarking of new virtual screening models. Code is available at https://github.com/tomhadfield95/synthVS.
引用
收藏
页数:15
相关论文
共 50 条
  • [1] Exploring the ability of machine learning-based virtual screening models to identify the functional groups responsible for binding
    Thomas E. Hadfield
    Jack Scantlebury
    Charlotte M. Deane
    Journal of Cheminformatics, 15
  • [2] Interpretation of machine learning-based prediction models and functional metagenomic approach to identify critical genes in HBCD degradation
    Lin, Yu-Jie
    Hsieh, Ping-Heng
    Mao, Chun-Chia
    Shih, Yang-Hsin
    Chen, Shu-Hwa
    Lin, Chung-Yen
    JOURNAL OF HAZARDOUS MATERIALS, 2025, 486
  • [3] Machine Learning-Based Virtual Screening for the Identification of Cdk5 Inhibitors
    Di Stefano, Miriana
    Galati, Salvatore
    Ortore, Gabriella
    Caligiuri, Isabella
    Rizzolio, Flavio
    Ceni, Costanza
    Bertini, Simone
    Bononi, Giulia
    Granchi, Carlotta
    Macchia, Marco
    Poli, Giulio
    Tuccinardi, Tiziano
    INTERNATIONAL JOURNAL OF MOLECULAR SCIENCES, 2022, 23 (18)
  • [4] The influence of negative training set size on machine learning-based virtual screening
    Rafał Kurczab
    Sabina Smusz
    Andrzej J Bojarski
    Journal of Cheminformatics, 6
  • [5] The influence of negative training set size on machine learning-based virtual screening
    Kurczab, Rafal
    Smusz, Sabina
    Bojarski, Andrzej J.
    JOURNAL OF CHEMINFORMATICS, 2014, 6
  • [6] Virtual sensors for wind turbines with machine learning-based time series models
    Dimitrov, Nikolay
    Gocmen, Tuhfe
    WIND ENERGY, 2022, 25 (09) : 1626 - 1645
  • [7] Beware of the generic machine learning-based scoring functions in structure-based virtual screening
    Shen, Chao
    Hu, Ye
    Wang, Zhe
    Zhang, Xujun
    Pang, Jinping
    Wang, Gaoang
    Zhong, Haiyang
    Xu, Lei
    Cao, Dongsheng
    Hou, Tingjun
    BRIEFINGS IN BIOINFORMATICS, 2021, 22 (03)
  • [8] Machine Learning-Based Approaches to Identify Diabetic Cardiomyopathy
    Patel, Kershaw
    Segar, Matthew
    Vaduganathan, Muthiah
    Tang, Wai Hong W.
    Willett, Duwayne
    Pandey, Ambarish
    CIRCULATION, 2022, 146
  • [9] Machine Learning-Based Virtual Screening and Identification of the Fourth-Generation EGFR Inhibitors
    Chang, Hao
    Zhang, Zeyu
    Tian, Jiaxin
    Bai, Tian
    Xiao, Zijie
    Wang, Dianpeng
    Qiao, Renzhong
    Li, Chao
    ACS OMEGA, 2024, 9 (02): : 2314 - 2324
  • [10] Machine Learning-based Virtual Screening for STAT3 Anticancer Drug Target
    Wadood, Abdul
    Ajmal, Amar
    Junaid, Muhammad
    Rehman, Ashfaq Ur
    Uddin, Reaz
    Azam, Syed Sikander
    Khan, Alam Zeb
    Ali, Asad
    CURRENT PHARMACEUTICAL DESIGN, 2022, 28 (36) : 3023 - 3032