Predicting genome-wide redundancy using machine learning

被引:7
|
作者
Chen, Huang-Wen [2 ]
Bandyopadhyay, Sunayan [2 ,3 ]
Shasha, Dennis E. [2 ]
Birnbaum, Kenneth D. [1 ]
机构
[1] NYU, Dept Biol, Ctr Genom & Syst Biol, New York, NY 10003 USA
[2] NYU, Dept Comp Sci, Courant Inst Math Sci, New York, NY 10003 USA
[3] Univ Minnesota Twin Cities, Dept Comp Sci & Engn, Minneapolis, MN 55455 USA
来源
BMC EVOLUTIONARY BIOLOGY | 2010年 / 10卷
关键词
GENE-EXPRESSION MAP; ARABIDOPSIS ROOT; SACCHAROMYCES-CEREVISIAE; DUPLICATE GENES; PHENOTYPE; NETWORKS; EVOLUTION; BIOLOGY; BIOINFORMATICS; PRESERVATION;
D O I
10.1186/1471-2148-10-357
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Background: Gene duplication can lead to genetic redundancy, which masks the function of mutated genes in genetic analyses. Methods to increase sensitivity in identifying genetic redundancy can improve the efficiency of reverse genetics and lend insights into the evolutionary outcomes of gene duplication. Machine learning techniques are well suited to classifying gene family members into redundant and non-redundant gene pairs in model species where sufficient genetic and genomic data is available, such as Arabidopsis thaliana, the test case used here. Results: Machine learning techniques that combine multiple attributes led to a dramatic improvement in predicting genetic redundancy over single trait classifiers alone, such as BLAST E-values or expression correlation. In withholding analysis, one of the methods used here, Support Vector Machines, was two-fold more precise than single attribute classifiers, reaching a level where the majority of redundant calls were correctly labeled. Using this higher confidence in identifying redundancy, machine learning predicts that about half of all genes in Arabidopsis showed the signature of predicted redundancy with at least one but typically less than three other family members. Interestingly, a large proportion of predicted redundant gene pairs were relatively old duplications (e. g., Ks > 1), suggesting that redundancy is stable over long evolutionary periods. Conclusions: Machine learning predicts that most genes will have a functionally redundant paralog but will exhibit redundancy with relatively few genes within a family. The predictions and gene pair attributes for Arabidopsis provide a new resource for research in genetics and genome evolution. These techniques can now be applied to other organisms.
引用
收藏
页数:15
相关论文
共 50 条
  • [41] Genome-Wide Analysis of MDR and XDR Tuberculosis from Belarus: Machine-Learning Approach
    Sergeev, Roman Sergeevich
    Kavaliou, Ivan S.
    Sataneuski, Uladzislau V.
    Gabrielian, Andrei
    Rosenthal, Alex
    Tartakovsky, Michael
    Tuzikov, Alexander V.
    IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2019, 16 (04) : 1398 - 1408
  • [42] Genome-wide prediction and prioritization of human aging genes by data fusion: a machine learning approach
    Masoud Arabfard
    Mina Ohadi
    Vahid Rezaei Tabar
    Ahmad Delbari
    Kaveh Kavousi
    BMC Genomics, 20
  • [43] A Novel Machine Learning Framework For Phenotype Prediction Based On Genome-Wide DNA Methylation Data
    Karagod, Vinay Vittal
    Sinha, Kaushik
    2017 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2017, : 1657 - 1664
  • [45] Predicting Peptide-Mediated Interactions on a Genome-Wide Scale
    Chen, T. Scott
    Petrey, Donald
    Garzon, Jose Ignacio
    Honig, Barry
    PLOS COMPUTATIONAL BIOLOGY, 2015, 11 (05)
  • [46] Genome-wide linkage and genome-wide association -: Can they be reconciled?
    Mueller-Myhsok, Bertram
    ANNALS OF HUMAN GENETICS, 2008, 72 : 687 - 687
  • [47] Combining learning and constraints for genome-wide protein annotation
    Stefano Teso
    Luca Masera
    Michelangelo Diligenti
    Andrea Passerini
    BMC Bioinformatics, 20
  • [48] Transfer Learning in Genome-Wide Association Studies with Knockoffs
    Li, Shuangning
    Ren, Zhimei
    Sabatti, Chiara
    Sesia, Matteo
    SANKHYA-SERIES B-APPLIED AND INTERDISCIPLINARY STATISTICS, 2022,
  • [49] A machine-compiled database of genome-wide association studies
    Kuleshov, Volodymyr
    Ding, Jialin
    Vo, Christopher
    Hancock, Braden
    Ratner, Alexander
    Li, Yang
    Re, Christopher
    Batzoglou, Serafim
    Snyder, Michael
    NATURE COMMUNICATIONS, 2019, 10 (1)
  • [50] Machine learning-based tissue of origin classification for cancer of unknown primary diagnostics using genome-wide mutation features
    Luan Nguyen
    Schipper, Luuk
    Roepman, Paul
    Monkhorst, Kim
    van Hoeck, Arne
    Snaebjornsson, Petur
    Cuppen, Edwin
    CANCER RESEARCH, 2022, 82 (12)