Annotation of the Giardia proteome through structure-based homology and machine learning

被引:18
|
作者
Ansell, Brendan R. E. [1 ]
Pope, Bernard J. [2 ,3 ,4 ,5 ]
Georgeson, Peter [2 ,3 ,4 ]
Emery-Corbin, Samantha J. [1 ]
Jex, Aaron R. [1 ,6 ]
机构
[1] Walter & Eliza Hall Inst Med Res, Populat Hlth & Immun Div, 1G Royal Pde, Parkville, Vic 3052, Australia
[2] Univ Melbourne, Melbourne Bioinformat, 187 Grattan St, Melbourne, Vic 3010, Australia
[3] Victorian Comprehens Canc Ctr, Ctr Canc Res, 305 Grattan St, Melbourne, Vic 3000, Australia
[4] Univ Melbourne, Dept Clin Pathol, 305 Grattan St, Melbourne, Vic 3000, Australia
[5] Monash Univ, Dept Med, Cent Clin Sch, 99 Commercial Rd, Melbourne, Vic 3004, Australia
[6] Univ Melbourne, Fac Vet & Agr Sci, Cnr Pk Dr & Flemington Rd, Melbourne, Vic 3010, Australia
来源
GIGASCIENCE | 2019年 / 8卷 / 01期
基金
澳大利亚研究理事会; 英国医学研究理事会;
关键词
Giardia duodenalis; structural homology; I-TASSER; random forest; functional prediction; machine learning; prioritization; parasite; protist; RESISTANCE MECHANISMS; STRUCTURE PREDICTION; I-TASSER;
D O I
10.1093/gigascience/giy150
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Background: Large-scale computational prediction of protein structures represents a cost-effective alternative to empirical structure determination with particular promise for non-model organisms and neglected pathogens. Conventional sequence-based tools are insufficient to annotate the genomes of such divergent biological systems. Conversely, protein structure tolerates substantial variation in primary amino acid sequence and is thus a robust indicator of biochemical function. Structural proteomics is poised to become a standard part of pathogen genomics research; however, informatic methods are now required to assign confidence in large volumes of predicted structures. Aims: Our aim was to predict the proteome of a neglected human pathogen, Giardia duodenalis, and stratify predicted structures into high- and lower-confidence categories using a variety of metrics in isolation and combination. Methods: We used the I-TASSER suite to predict structural models for similar to 5,000 proteins encoded in G. duodenalis and identify their closest empirically-determined structural homologues in the Protein Data Bank. Models were assigned to high- or lower-confidence categories depending on the presence of matching protein family (Pfam) domains in query and reference peptides. Metrics output from the suite and derived metrics were assessed for their ability to predict the high-confidence category individually, and in combination through development of a random forest classifier. Results: We identified 1,095 high-confidence models including 212 hypothetical proteins. Amino acid identity between query and reference peptides was the greatest individual predictor of high-confidence status; however, the random forest classifier outperformed any metric in isolation (area under the receiver operating characteristic curve = 0.976) and identified a subset of 305 high-confidence-like models, corresponding to false-positive predictions. High-confidence models exhibited greater transcriptional abundance, and the classifier generalized across species, indicating the broad utility of this approach for automatically stratifying predicted structures. Additional structure-based clustering was used to cross-check confidence predictions in an expanded family of Nek kinases. Several high-confidence-like proteins yielded substantial new insight into mechanisms of redox balance in G. duodenalis-a system central to the efficacy of limited anti-giardial drugs. Conclusion: Structural proteomics combined with machine learning can aid genome annotation for genetically divergent organisms, including human pathogens, and stratify predicted structures to promote efficient allocation of limited resources for experimental investigation.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] Beyond sequence: Structure-based machine learning
    Durairaj, Janani
    de Ridder, Dick
    van Dijk, Aalt D. J.
    COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL, 2023, 21 : 630 - 643
  • [2] A multidimensional dataset for structure-based machine learning
    Holcomb, Matthew
    Forli, Stefano
    NATURE COMPUTATIONAL SCIENCE, 2024, 4 (05): : 318 - 319
  • [3] Machine learning approach for structure-based zeolite classification
    Carr, D. Andrew
    Lach-hab, Mohammed
    Yang, Shujiang
    Vaisman, Iosif I.
    Blaisten-Barojas, Estela
    MICROPOROUS AND MESOPOROUS MATERIALS, 2009, 117 (1-2) : 339 - 349
  • [4] Remodelling structure-based drug design using machine learning
    Dutta, Shubhankar
    Bose, Kakoli
    EMERGING TOPICS IN LIFE SCIENCES, 2021, 5 (01) : 13 - 27
  • [5] Identification of potential AChE inhibitors through combined machine-learning and structure-based design approaches
    Ganeshpurkar, Ankit
    Singh, Ravi
    Singh, Ravi Bhushan
    Kumar, Devendra
    Kumar, Ashok
    Singh, Sushil Kumar
    INDIAN JOURNAL OF BIOCHEMISTRY & BIOPHYSICS, 2022, 59 (06): : 619 - 631
  • [6] Structure-based, biophysical annotation of molecular coevolution of acetylcholinesterase
    Weissgraeber, Stephanie
    Hoffgaard, Franziska
    Hamacher, Kay
    PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2011, 79 (11) : 3144 - 3154
  • [7] Improving Structure-Based Virtual Screening with Ensemble Docking and Machine Learning
    Ricci-Lopez, Joel
    Aguila, Sergio A.
    Gilson, Michael K.
    Brizuela, Carlos A.
    JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2021, 61 (11) : 5362 - 5376
  • [8] Machine-learning scoring functions for structure-based virtual screening
    Li Hongjian
    Sze, Kam-Heung
    Lu Gang
    Ballester, Pedro J.
    WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL MOLECULAR SCIENCE, 2021, 11 (01)
  • [9] Traditional and machine learning approaches in structure-based drug virtual screening
    Zhang, Hong
    Gao, Yi Qin
    CHINESE JOURNAL OF CHEMICAL PHYSICS, 2024, 37 (02) : 177 - 191
  • [10] FINDSITE-metal: Integrating evolutionary information and machine learning for structure-based metal-binding site prediction at the proteome level
    Brylinski, Michal
    Skolnick, Jeffrey
    PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2011, 79 (03) : 735 - 751