Robust structured heterogeneity analysis approach for high-dimensional data

被引:3
|
作者
Sun, Yifan [1 ,2 ]
Luo, Ziye [2 ]
Fan, Xinyan [1 ,2 ]
机构
[1] Renmin Univ China, Ctr Appl Stat, 59 Zhongguancun St, Beijing 100872, Peoples R China
[2] Renmin Univ China, Sch Stat, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
high-dimensional data; overlapping clusters; robustness; subgroup identification; DIVERGING NUMBER; FINITE MIXTURE; REGRESSION; SELECTION; QM;
D O I
10.1002/sim.9414
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Revealing relationships between genes and disease phenotypes is a critical problem in biomedical studies. This problem has been challenged by the heterogeneity of diseases. Patients of a perceived same disease may form multiple subgroups, and different subgroups have distinct sets of important genes. It is hence imperative to discover the latent subgroups and reveal the subgroup-specific important genes. Some heterogeneity analysis methods have been proposed in the recent literature. Despite considerable successes, most of the existing studies are still limited as they cannot accommodate data contamination and ignore the interconnections among genes. Aiming at these shortages, we develop a robust structured heterogeneity analysis approach to identify subgroups, select important genes as well as estimate their effects on the phenotype of interest. Possible data contamination is accommodated by employing the Huber loss function. A sparse overlapping group lasso penalty is imposed to conduct regularization estimation and gene identification, while taking into account the possibly overlapping cluster structure of genes. This approach takes an iterative strategy in the similar spirit of K-means clustering. Simulations demonstrate that the proposed approach outperforms alternatives in revealing the heterogeneity and selecting important genes for each subgroup. The analysis of Cancer Cell Line Encyclopedia data leads to biologically meaningful findings with improved prediction and grouping stability.
引用
收藏
页码:3229 / 3259
页数:31
相关论文
共 50 条
  • [11] A robust variable screening method for high-dimensional data
    Wang, Tao
    Zheng, Lin
    Li, Zhonghua
    Liu, Haiyang
    JOURNAL OF APPLIED STATISTICS, 2017, 44 (10) : 1839 - 1855
  • [12] Robust high-dimensional regression for data with anomalous responses
    Ren, Mingyang
    Zhang, Sanguo
    Zhang, Qingzhao
    ANNALS OF THE INSTITUTE OF STATISTICAL MATHEMATICS, 2021, 73 (04) : 703 - 736
  • [13] Robust linear regression for high-dimensional data: An overview
    Filzmoser, Peter
    Nordhausen, Klaus
    WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL STATISTICS, 2021, 13 (04)
  • [14] ON ROBUST INFORMATION EXTRACTION FROM HIGH-DIMENSIONAL DATA
    Kalina, Jan
    SERBIAN JOURNAL OF MANAGEMENT, 2014, 9 (01) : 131 - 144
  • [15] Robust feature screening for high-dimensional survival data
    Hao, Meiling
    Lin, Yuanyuan
    Liu, Xianhui
    Tang, Wenlu
    JOURNAL OF APPLIED STATISTICS, 2019, 46 (06) : 979 - 994
  • [16] Cauchy robust principal component analysis with applications to high-dimensional data sets
    Fayomi, Aisha
    Pantazis, Yannis
    Tsagris, Michail
    Wood, Andrew T. A.
    STATISTICS AND COMPUTING, 2024, 34 (01)
  • [17] Robust Latent Factor Analysis for Precise Representation of High-Dimensional and Sparse Data
    Wu, Di
    Luo, Xin
    IEEE-CAA JOURNAL OF AUTOMATICA SINICA, 2021, 8 (04) : 796 - 805
  • [18] Robust Latent Factor Analysis for Precise Representation of High-Dimensional and Sparse Data
    Di Wu
    Xin Luo
    IEEE/CAA Journal of Automatica Sinica, 2021, 8 (04) : 796 - 805
  • [19] Cauchy robust principal component analysis with applications to high-dimensional data sets
    Aisha Fayomi
    Yannis Pantazis
    Michail Tsagris
    Andrew T. A. Wood
    Statistics and Computing, 2024, 34
  • [20] Robust and sparse correlation matrix estimation for the analysis of high-dimensional genomics data
    Serra, Angela
    Coretto, Pietro
    Fratello, Michele
    Tagliaferri, Roberto
    BIOINFORMATICS, 2018, 34 (04) : 625 - 634