HPClas: A data-driven approach for identifying halophilic proteins based on catBoost

被引:1
|
作者
Hu, Shantong [1 ]
Wang, Xiaoyu [2 ,3 ]
Wang, Zhikang [2 ,3 ]
Jiang, Menghan [1 ]
Wang, Shihui [1 ]
Wang, Wenya [1 ]
Song, Jiangning [2 ,3 ]
Zhang, Guimin [1 ]
机构
[1] Beijing Univ Chem Technol, Coll Life Sci & Technol, Beijing, Peoples R China
[2] Monash Univ, Monash Biomed Discovery Inst, Melbourne, Vic, Australia
[3] Monash Univ, Dept Biochem & Mol Biol, Melbourne, Vic, Australia
来源
MLIFE | 2024年 / 3卷 / 04期
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
feature engineering; halophilic protein; machine learning; AMINO-ACID-COMPOSITION; MANUALLY ANNOTATED SECTION; UNIPROTKB/SWISS-PROT; SECONDARY STRUCTURE; CRYSTAL-STRUCTURE; DOMAIN; LOCALIZATION; ADAPTATION; GENERATION; STABILITY;
D O I
10.1002/mlf2.12125
中图分类号
Q93 [微生物学];
学科分类号
071005 ; 100705 ;
摘要
Halophilic proteins possess unique structural properties and show high stability under extreme conditions. This distinct characteristic makes them invaluable for application in various aspects such as bioenergy, pharmaceuticals, environmental clean-up, and energy production. Generally, halophilic proteins are discovered and characterized through labor-intensive and time-consuming wet lab experiments. In this study, we introduce the Halophilic Protein Classifier (HPClas), a machine learning-based classifier developed using the catBoost ensemble learning technique to identify halophilic proteins. Extensive in silico calculations were conducted on a large public dataset of 12,574 samples and HPClas achieved an area under the receiver operating characteristic curve (AUROC) of 0.844 on an independent test set of 200 samples. The source code and curated dataset of HPClas are publicly available at . In conclusion, HPClas can be explored as a promising tool to aid in the identification of halophilic proteins and accelerate their application in different fields. A method based on prediction of proteins secreted by extreme halophilic bacteria was used to successfully extract a large number of halophilic proteins. Using these data, we have trained an accurate halophilic protein classifier that could determine whether an input protein is halophilic with a high accuracy of 84.5%. This work could not only promote the exploration and mining of halophilic proteins in nature but also provide guidance for the generation of mutant halophilic enzymes.
引用
收藏
页码:515 / 526
页数:12
相关论文
共 50 条
  • [1] A Data-Driven Approach Based on LDA for Identifying Duplicate Bug Report
    Chen Jingliang
    Ming Zhe
    Su Jun
    2016 IEEE 8TH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS (IS), 2016, : 686 - 691
  • [2] Data-driven approach for identifying spatiotemporally recurrent bottlenecks
    Song, Tai-Jin
    Williams, Billy M.
    Rouphail, Nagui M.
    IET INTELLIGENT TRANSPORT SYSTEMS, 2018, 12 (08) : 756 - 764
  • [3] Identifying Subnetwork Fingerprints in Structural Connectomes: A Data-Driven Approach
    Munsell, Brent C.
    Hofesmann, Eric
    Delgaizo, John
    Styner, Martin
    Bonilha, Leonardo
    CONNECTOMICS IN NEUROIMAGING, 2017, 10511 : 79 - 88
  • [4] Identifying subpopulations of septic patients: A temporal data-driven approach
    Sharafoddini, Anis
    Dubin, Joel A.
    Lee, Joon
    COMPUTERS IN BIOLOGY AND MEDICINE, 2021, 130
  • [5] A Data-Driven Approach for Identifying Medicinal Combinations of Natural Products
    Yoo, Sunyong
    Ha, Suhyun
    Shin, Moonshik
    Noh, Kyungrin
    Nam, Hojung
    Lee, Doheon
    IEEE ACCESS, 2018, 6 : 58106 - 58118
  • [6] DATA-DRIVEN APPROACH FOR IDENTIFYING MISTUNING IN AS-MANUFACTURED BLISKS
    Kelly, Sean T.
    Lupini, Andrea
    Epureanu, Bogdan I.
    PROCEEDINGS OF ASME TURBO EXPO 2021: TURBOMACHINERY TECHNICAL CONFERENCE AND EXPOSITION, VOL 9B, 2021,
  • [7] Data-Driven Approach for Identifying Mistuning in As-Manufactured Blisks
    Kelly, Sean T.
    Lupini, Andrea
    Epureanu, Bogdan, I
    JOURNAL OF ENGINEERING FOR GAS TURBINES AND POWER-TRANSACTIONS OF THE ASME, 2022, 144 (05):
  • [8] A data-driven approach for identifying project manager competency weights
    Hanna, Awad S.
    Iskandar, Karim A.
    Lotfallah, Wafik
    Ibrahim, Michael W.
    Russell, Jeffrey S.
    CANADIAN JOURNAL OF CIVIL ENGINEERING, 2018, 45 (01) : 1 - 8
  • [9] A hybrid knowledge-based and data-driven approach to identifying semantically similar concepts
    Pivovarov, Rimma
    Elhadad, Noemie
    JOURNAL OF BIOMEDICAL INFORMATICS, 2012, 45 (03) : 471 - 481
  • [10] A Data-Driven Approach to Identifying Acute Coronary Syndrome in the Emergency Department
    Sutton, N.
    Smith, S.
    Plumley, J.
    Tamayo-Sarver, J.
    Butler, I
    ANNALS OF EMERGENCY MEDICINE, 2018, 72 (04) : S14 - S15