Sequence Based Prediction of DNA-Binding Proteins Based on Hybrid Feature Selection Using Random Forest and Gaussian Naive Bayes

被引：139

作者：

Lou, Wangchao ^{[1
]}

Wang, Xiaoqing ^{[1
]}

Chen, Fan ^{[1
]}

Chen, Yixiao ^{[1
]}

Jiang, Bo ^{[1
]}

Zhang, Hua ^{[1
]}

机构：

[1] Zhejiang Gongshang Univ, Sch Comp & Informat Engn, Hangzhou, Zhejiang, Peoples R China

来源：

PLOS ONE | 2014年 / 9卷 / 01期

基金：

中国国家自然科学基金;

关键词：

RIBOSOMAL-RNA-BINDING; SECONDARY STRUCTURE; EVOLUTIONARY CONSERVATION; FOLD RECOGNITION; IDENTIFICATION; COVARIANCE; RESOLUTION; ACCURATE; RECEPTORS; DOMAINS;

D O I：

10.1371/journal.pone.0086703

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

Developing an efficient method for determination of the DNA-binding proteins, due to their vital roles in gene regulation, is becoming highly desired since it would be invaluable to advance our understanding of protein functions. In this study, we proposed a new method for the prediction of the DNA-binding proteins, by performing the feature rank using random forest and the wrapper-based feature selection using forward best-first search strategy. The features comprise information from primary sequence, predicted secondary structure, predicted relative solvent accessibility, and position specific scoring matrix. The proposed method, called DBPPred, used Gaussian naive Bayes as the underlying classifier since it outperformed five other classifiers, including decision tree, logistic regression, k-nearest neighbor, support vector machine with polynomial kernel, and support vector machine with radial basis function. As a result, the proposed DBPPred yields the highest average accuracy of 0.791 and average MCC of 0.583 according to the five-fold cross validation with ten runs on the training benchmark dataset PDB594. Subsequently, blind tests on the independent dataset PDB186 by the proposed model trained on the entire PDB594 dataset and by other five existing methods (including iDNA-Prot, DNA-Prot, DNAbinder, DNABIND and DBD-Threader) were performed, resulting in that the proposed DBPPred yielded the highest accuracy of 0.769, MCC of 0.538, and AUC of 0.790. The independent tests performed by the proposed DBPPred on completely a large non-DNA binding protein dataset and two RNA binding protein datasets also showed improved or comparable quality when compared with the relevant prediction methods. Moreover, we observed that majority of the selected features by the proposed method are statistically significantly different between the mean feature values of the DNA-binding and the non DNA-binding proteins. All of the experimental results indicate that the proposed DBPPred can be an alternative perspective predictor for large-scale determination of DNA-binding proteins.

引用

页数：10

共 50 条

[1] DNABP: Identification of DNA-Binding Proteins Based on Feature Selection Using a Random Forest and Predicting Binding Residues
Ma, Xin
Guo, Jing
Sun, Xiao
PLOS ONE, 2016, 11 (12):
[2] Prediction of DNA-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature
Wu, Jiansheng
Liu, Hongde
Duan, Xueye
Ding, Yan
Wu, Hongtao
Bai, Yunfei
Sun, Xiao
BIOINFORMATICS, 2009, 25 (01) : 30 - 35
[3] Sequence-based prediction of DNA-binding sites on DNA-binding proteins
Gou, Z.
Hwang, S.
Kuznetsov, B., I
PROCEEDINGS OF THE FIFTH INTERNATIONAL CONFERENCE ON BIOINFORMATICS OF GENOME REGULATION AND STRUCTURE, VOL 1, 2006, : 268 - +
[4] An improved sequence based prediction protocol for DNA-binding proteins using SVM and comprehensive feature analysis
Zou, Chuanxin
Gong, Jiayu
Li, Honglin
BMC BIOINFORMATICS, 2013, 14
[5] An improved sequence based prediction protocol for DNA-binding proteins using SVM and comprehensive feature analysis
Chuanxin Zou
Jiayu Gong
Honglin Li
BMC Bioinformatics, 14
[6] Sequence-Based Prediction of RNA-Binding Proteins Using Random Forest with Minimum Redundancy Maximum Relevance Feature Selection
Ma, Xin
Guo, Jing
Sun, Xiao
BIOMED RESEARCH INTERNATIONAL, 2015, 2015
[7] HMMPred: Accurate Prediction of DNA-Binding Proteins Based on HMM Profiles and XGBoost Feature Selection
Sang, Xiuzhi
Xiao, Wanyue
Zheng, Huiwen
Yang, Yang
Liu, Taigang
COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE, 2020, 2020 (2020)
[8] Sequence-based Detection of DNA-binding Proteins using Multiple-View Features Allied with Feature Selection
Zhou, Liling
Song, Xiaoning
Yu, Dong-Jun
Sun, Jun
MOLECULAR INFORMATICS, 2020, 39 (08)
[9] Improved Prediction of DNA-Binding Proteins Using Chaos Game Representation and Random Forest
Niu, Xiaohui
Hu, Xuehai
CURRENT BIOINFORMATICS, 2016, 11 (02) : 156 - 163
[10] Prediction of Hormone-Binding Proteins Based on K-mer Feature Representation and Naive Bayes
Guo, Yuxin
Hou, Liping
Zhu, Wen
Wang, Peng
FRONTIERS IN GENETICS, 2021, 12

← 1 2 3 4 5 →