A Multimodal Fusion Drug Molecular Attribute Prediction Method Based on Bert and GCN

被引:0
|
作者
Yan, Xiao-Ying [1 ]
Jin, Yan-Chun [1 ]
Feng, Yue-Hua [1 ]
Zhang, Shao-Wu [2 ]
机构
[1] Xian Shiyou Univ, Coll Comp Sci, Xian 710065, Peoples R China
[2] Northwestern Polytech Univ, Sch Automat, Key Lab Informat Fus Technol, Minist Educ, Xian 710072, Peoples R China
基金
中国国家自然科学基金;
关键词
Bert pretraining; attention mechanism; molecular fingerprint; molecular attribute prediction; graph convolutional neural network;
D O I
10.16476/j.pibb.2024.0299
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Objective Molecular property prediction plays a crucial role in drug development, especially in virtual screening and compound optimization. The advancement of artificial intelligence (AI) technologies has led to the emergence of numerous deep learning-based methods, which have demonstrated significant potential in improving molecular property prediction. Nonetheless, acquiring labeled molecular data can be both costly and time-consuming. The scarcity of labeled data poses a substantial challenge for supervised machine learning models to effectively generalize across the vast chemical space. In order to overcome the above limitations, in this work, we proposed a novel Bert and GCN-based multimodal fusion method (called BGMF) to predict molecular property.Methods BGMF can extract comprehensive molecular representation from atomic sequences, molecular fingerprint sequences, and molecular graph data and combine them through pre-training and fine-tuning. Specifically, our method consists of the following three main parts. (1) Molecular feature extraction; (2) Bert-GCN based pre-training; (3) fine-tuning. During molecular feature extraction, the Morgan algorithm is employed to generate the molecular fingerprints, transforming input SMILES strings of drugs into molecular fingerprint sentences. Simultaneously, atom sentences are created based on the atom indices within the molecule, Consequently, drug molecule are represented as both molecular fingerprint sentences and atom sentences. In the pre-training section, BGMF utilizes a self-supervised learning strategy, specifically masked molecular fingerprint and masked atom recovery, on a large dataset of unlabeled data using the Bert model. Here, molecular graph data is incorporated by merging graph convolutional neural networks with the Bert model, effectively combining the global "word" features of drug molecules with the local topological features of molecular graphs. We have also developed a dual decoder for atomic and molecular fingerprints to amplify molecular feature expression. Finally, in the fine-tuning stage, the addition of a pooling layer and task-specific fully connected neural networks allows the pre-trained module to be applied to a variety of downstream tasks for molecular property prediction.Results To validate the effectiveness of our BGMF, we conduct several experiments on 43 molecular attribute prediction tasks across 5 datasets. In comparison with other recent state-of-the-art methods, our BGMF achieves the best results in terms of area under the ROC curve (AUC). We also verified the generalization performance of the BGMF model by constructing independent test dataset, showing that the BGMF model has the best generalization performance. Additionally, we conduct the ablation studies to demonstrate the effect of atomic sequence, molecular fingerprint sequence, GCN based molecular graph module, and pre-training module on the overall performance of the model.Conclusion In this paper, we propose a novel method for drug molecular attribute prediction named BGMF which integrating the molecular graph data into tasks of molecular fingerprint recovery and masked atom recovery by combining graph convolutional neural network with the Bert model. The molecular fingerprint representations generated by BGMF were visualized using t-SNE, revealing that the BGMF model effectively captures the intrinsic structure and features of molecular fingerprints.
引用
收藏
页码:783 / 794
页数:12
相关论文
共 36 条
  • [1] GMPP-NN: a deep learning architecture for graph molecular property prediction
    Abbassi, Outhman
    Ziti, Soumia
    Belhiah, Meryam
    Lagmiri, Souad Najoua
    Seghroucheni, Yassine Zaoui
    [J]. DISCOVER APPLIED SCIENCES, 2024, 6 (07)
  • [2] Improving Graph Neural Network Expressivity via Subgraph Isomorphism Counting
    Bouritsas, Giorgos
    Frasca, Fabrizio
    Zafeiriou, Stefanos
    Bronstein, Michael M.
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (01) : 657 - 668
  • [3] Different molecular enumeration influences in deep learning: an example using aqueous solubility
    Chen, Jen-Hao
    Tseng, Yufeng Jane
    [J]. BRIEFINGS IN BIOINFORMATICS, 2021, 22 (03)
  • [4] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
  • [5] GPT-3: Its Nature, Scope, Limits, and Consequences
    Floridi, Luciano
    Chiriatti, Massimo
    [J]. MINDS AND MACHINES, 2020, 30 (04) : 681 - 694
  • [6] Gilmer J, 2017, Arxiv, DOI [arXiv:1704.01212, 10.48550/arXiv.1704.01212, DOI 10.48550/ARXIV.1704.01212]
  • [7] HimGNN: a novel hierarchical molecular graph representation learning framework for property prediction
    Han, Shen
    Fu, Haitao
    Wu, Yuyang
    Zhao, Ganglan
    Song, Zhenyu
    Huang, Feng
    Zhang, Zhongfei
    Liu, Shichao
    Zhang, Wen
    [J]. BRIEFINGS IN BIOINFORMATICS, 2023, 24 (05)
  • [8] Masked Autoencoders Are Scalable Vision Learners
    He, Kaiming
    Chen, Xinlei
    Xie, Saining
    Li, Yanghao
    Dollar, Piotr
    Girshick, Ross
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15979 - 15988
  • [9] Identity Mappings in Deep Residual Networks
    He, Kaiming
    Zhang, Xiangyu
    Ren, Shaoqing
    Sun, Jian
    [J]. COMPUTER VISION - ECCV 2016, PT IV, 2016, 9908 : 630 - 645
  • [10] ZINC: A Free Tool to Discover Chemistry for Biology
    Irwin, John J.
    Sterling, Teague
    Mysinger, Michael M.
    Bolstad, Erin S.
    Coleman, Ryan G.
    [J]. JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2012, 52 (07) : 1757 - 1768