Robust structured heterogeneity analysis approach for high-dimensional data

被引：3

作者：

Sun, Yifan ^{[1
,2
]}

Luo, Ziye ^{[2
]}

Fan, Xinyan ^{[1
,2
]}

机构：

[1] Renmin Univ China, Ctr Appl Stat, 59 Zhongguancun St, Beijing 100872, Peoples R China

[2] Renmin Univ China, Sch Stat, Beijing, Peoples R China

来源：

STATISTICS IN MEDICINE | 2022年 / 41卷 / 17期

基金：

中国国家自然科学基金;

关键词：

high-dimensional data; overlapping clusters; robustness; subgroup identification; DIVERGING NUMBER; FINITE MIXTURE; REGRESSION; SELECTION; QM;

D O I：

10.1002/sim.9414

中图分类号：

Q [生物科学];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

Revealing relationships between genes and disease phenotypes is a critical problem in biomedical studies. This problem has been challenged by the heterogeneity of diseases. Patients of a perceived same disease may form multiple subgroups, and different subgroups have distinct sets of important genes. It is hence imperative to discover the latent subgroups and reveal the subgroup-specific important genes. Some heterogeneity analysis methods have been proposed in the recent literature. Despite considerable successes, most of the existing studies are still limited as they cannot accommodate data contamination and ignore the interconnections among genes. Aiming at these shortages, we develop a robust structured heterogeneity analysis approach to identify subgroups, select important genes as well as estimate their effects on the phenotype of interest. Possible data contamination is accommodated by employing the Huber loss function. A sparse overlapping group lasso penalty is imposed to conduct regularization estimation and gene identification, while taking into account the possibly overlapping cluster structure of genes. This approach takes an iterative strategy in the similar spirit of K-means clustering. Simulations demonstrate that the proposed approach outperforms alternatives in revealing the heterogeneity and selecting important genes for each subgroup. The analysis of Cancer Cell Line Encyclopedia data leads to biologically meaningful findings with improved prediction and grouping stability.

引用

页码：3229 / 3259

页数：31

共 50 条

[11] A robust variable screening method for high-dimensional data
Wang, Tao
Zheng, Lin
Li, Zhonghua
Liu, Haiyang
JOURNAL OF APPLIED STATISTICS, 2017, 44 (10) : 1839 - 1855
[12] Robust high-dimensional regression for data with anomalous responses
Ren, Mingyang
Zhang, Sanguo
Zhang, Qingzhao
ANNALS OF THE INSTITUTE OF STATISTICAL MATHEMATICS, 2021, 73 (04) : 703 - 736
[13] Robust linear regression for high-dimensional data: An overview
Filzmoser, Peter
Nordhausen, Klaus
WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL STATISTICS, 2021, 13 (04)
[14] ON ROBUST INFORMATION EXTRACTION FROM HIGH-DIMENSIONAL DATA
Kalina, Jan
SERBIAN JOURNAL OF MANAGEMENT, 2014, 9 (01) : 131 - 144
[15] Robust feature screening for high-dimensional survival data
Hao, Meiling
Lin, Yuanyuan
Liu, Xianhui
Tang, Wenlu
JOURNAL OF APPLIED STATISTICS, 2019, 46 (06) : 979 - 994
[16] Cauchy robust principal component analysis with applications to high-dimensional data sets
Fayomi, Aisha
Pantazis, Yannis
Tsagris, Michail
Wood, Andrew T. A.
STATISTICS AND COMPUTING, 2024, 34 (01)
[17] Robust Latent Factor Analysis for Precise Representation of High-Dimensional and Sparse Data
Wu, Di
Luo, Xin
IEEE-CAA JOURNAL OF AUTOMATICA SINICA, 2021, 8 (04) : 796 - 805
[18] Robust Latent Factor Analysis for Precise Representation of High-Dimensional and Sparse Data
Di Wu
Xin Luo
IEEE/CAA Journal of Automatica Sinica, 2021, 8 (04) : 796 - 805
[19] Cauchy robust principal component analysis with applications to high-dimensional data sets
Aisha Fayomi
Yannis Pantazis
Michail Tsagris
Andrew T. A. Wood
Statistics and Computing, 2024, 34
[20] Robust and sparse correlation matrix estimation for the analysis of high-dimensional genomics data
Serra, Angela
Coretto, Pietro
Fratello, Michele
Tagliaferri, Roberto
BIOINFORMATICS, 2018, 34 (04) : 625 - 634

← 1 2 3 4 5 →