Separating and reintegrating latent variables to improve classification of genomic data

被引:0
|
作者
Payne, Nora Yujia [1 ]
Gagnon-Bartsch, Johann A. [1 ]
机构
[1] Univ Michigan, Dept Stat, 1085 S Univ Ave, Ann Arbor, MI 48109 USA
基金
美国国家科学基金会;
关键词
Classification; Gene expression; Linear discriminant analysis; GENE-EXPRESSION; FEATURE-SELECTION; AIR-POLLUTION; METHYLATION; REGRESSION;
D O I
10.1093/biostatistics/kxab046
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Genomic data sets contain the effects of various unobserved biological variables in addition to the variable of primary interest. These latent variables often affect a large number of features (e.g., genes), giving rise to dense latent variation. This latent variation presents both challenges and opportunities for classification. While some of these latent variables may be partially correlated with the phenotype of interest and thus helpful, others may be uncorrelated and merely contribute additional noise. Moreover, whether potentially helpful or not, these latent variables may obscure weaker effects that impact only a small number of features but more directly capture the signal of primary interest. To address these challenges, we propose the cross-residualization classifier (CRC). Through an adjustment and ensemble procedure, the CRC estimates and residualizes out the latent variation, trains a classifier on the residuals, and then reintegrates the latent variation in a final ensemble classifier. Thus, the latent variables are accounted for without discarding any potentially predictive information. We apply the method to simulated data and a variety of genomic data sets from multiple platforms. In general, we find that the CRC performs well relative to existing classifiers and sometimes offers substantial gains.
引用
收藏
页码:1133 / 1149
页数:17
相关论文
共 50 条
  • [41] Unlabeling data can improve classification accuracy
    Lausser, Ludwig
    Schmid, Florian
    Schmid, Matthias
    Kestler, Hans A.
    PATTERN RECOGNITION LETTERS, 2014, 37 : 15 - 23
  • [42] On regression and classification with possibly missing response variables in the data
    Mojirsheibani, Majid
    Pouliot, William
    Shakhbandaryan, Andre
    METRIKA, 2024, 87 (06) : 607 - 648
  • [43] SEPARATING CHILD AND PARENT INFLUENCES ON CHILD DEPRESSION AND ANXIETY USING GENOMIC FAMILY DATA
    Cheesman, Rosa
    Eilertsen, Espen
    Ahmadzadeh, Yasmin
    Gjerde, Line
    Hannigan, Laurie
    Havdahl, Alexandra
    Eley, Thalia
    Ystrom, Eivind
    McAdams, Tom
    EUROPEAN NEUROPSYCHOPHARMACOLOGY, 2019, 29 : S119 - S119
  • [44] Genomic data provide insights into the classification of extant termites
    Hellemans, Simon
    Rocha, Mauricio M.
    Wang, Menglin
    Arias, Johanna Romero
    Aanen, Duur K.
    Bagneres, Anne-Genevieve
    Bucek, Ales
    Carrijo, Tiago F.
    Chouvenc, Thomas
    Cuezzo, Carolina
    Constantini, Joice P.
    Constantino, Reginaldo
    Dedeine, Franck
    Deligne, Jean
    Eggleton, Paul
    Evans, Theodore A.
    Hanus, Robert
    Harrison, Mark C.
    Harry, Myriam
    Josens, Guy
    Jouault, Corentin
    Kalleshwaraswamy, Chicknayakanahalli M.
    Kaymak, Esra
    Korb, Judith
    Lee, Chow-Yang
    Legendre, Frederic
    Li, Hou-Feng
    Lo, Nathan
    Lu, Tomer
    Matsuura, Kenji
    Maekawa, Kiyoto
    McMahon, Dino P.
    Mizumoto, Nobuaki
    Oliveira, Danilo E.
    Poulsen, Michael
    Sillam-Dusses, David
    Su, Nan-Yao
    Tokuda, Gaku
    Vargo, Edward L.
    Ware, Jessica L.
    Sobotnik, Jan
    Scheffrahn, Rudolf H.
    Cancello, Eliana
    Roisin, Yves
    Engel, Michael S.
    Bourguignon, Thomas
    NATURE COMMUNICATIONS, 2024, 15 (01)
  • [45] Classification of Retroviruses Based on Genomic Data Using RVGC
    Aamir, Khalid Mahmood
    Bilal, Muhammad
    Ramzan, Muhammad
    Khan, Muhammad Attique
    Nam, Yunyoung
    Kadry, Seifedine
    CMC-COMPUTERS MATERIALS & CONTINUA, 2021, 69 (03): : 3829 - 3844
  • [46] GeneBrowser: an approach for integration and functional classification of genomic data
    Arrais, Joel
    Santos, Bruno
    Fernandes, Joao
    Carreto, Laura
    Santos, Manuel A. S.
    Oliveira, Jose Luis
    JOURNAL OF INTEGRATIVE BIOINFORMATICS, 2007, 4 (03)
  • [47] Classification and selection of biomarkers in genomic data using LASSO
    Ghosh, D
    Chinnaiyan, AM
    JOURNAL OF BIOMEDICINE AND BIOTECHNOLOGY, 2005, (02): : 147 - 154
  • [48] Classification of genomic data: Some aspects of feature selection
    Czekaj, Tomasz
    Wu, Wen
    Walczak, Beata
    TALANTA, 2008, 76 (03) : 564 - 574
  • [49] Genomic data provide insights into the classification of extant termites
    Simon Hellemans
    Mauricio M. Rocha
    Menglin Wang
    Johanna Romero Arias
    Duur K. Aanen
    Anne-Geneviève Bagnères
    Aleš Buček
    Tiago F. Carrijo
    Thomas Chouvenc
    Carolina Cuezzo
    Joice P. Constantini
    Reginaldo Constantino
    Franck Dedeine
    Jean Deligne
    Paul Eggleton
    Theodore A. Evans
    Robert Hanus
    Mark C. Harrison
    Myriam Harry
    Guy Josens
    Corentin Jouault
    Chicknayakanahalli M. Kalleshwaraswamy
    Esra Kaymak
    Judith Korb
    Chow-Yang Lee
    Frédéric Legendre
    Hou-Feng Li
    Nathan Lo
    Tomer Lu
    Kenji Matsuura
    Kiyoto Maekawa
    Dino P. McMahon
    Nobuaki Mizumoto
    Danilo E. Oliveira
    Michael Poulsen
    David Sillam-Dussès
    Nan-Yao Su
    Gaku Tokuda
    Edward L. Vargo
    Jessica L. Ware
    Jan Šobotník
    Rudolf H. Scheffrahn
    Eliana Cancello
    Yves Roisin
    Michael S. Engel
    Thomas Bourguignon
    Nature Communications, 15 (1)
  • [50] Separating child and parent influences on child emotional problems using genomic family data
    Cheesman, Rosa
    Eilertsen, Espen Moen
    Ahmadzadeh, Yasmin
    Gjerde, Line
    Hannigan, Laurie
    Havdahl, Alexandra
    Eley, Thalia
    Ystrom, Eivind
    McAdams, Tom
    BEHAVIOR GENETICS, 2019, 49 (06) : 523 - 524