REMI: REGRESSION WITH MARGINAL INFORMATION AND ITS APPLICATION IN GENOME-WIDE ASSOCIATION STUDIES

被引:0
|
作者
Huang, Jian [1 ,2 ]
Jiao, Yuling [1 ,2 ]
Liu, Jin [1 ,2 ]
Yang, Can [1 ,2 ]
机构
[1] Univ Iowa, Duke NUS Med Sch, Zhongnan Univ Econ & Law, Iowa City, IA 52242 USA
[2] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
关键词
Genome-wide association studies; high dimensional regression; marginal information; polygenic risk score; VARIABLE SELECTION; GENETIC ARCHITECTURE; CAUSAL VARIANTS; STATISTICS; COMMON; REGULARIZATION; HERITABILITY; LASSO; LOCI;
D O I
10.5705/ss.202019.0182
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
We consider the problem of variable selection and estimation in high-dimensional linear regression models when complete data are not accessible, but we do have certain marginal information or summary statistics. This problem is motivated by genome-wide association studies (GWASs) with millions of genotyped single nucleotide polymorphisms (SNPs), which have been widely used to identify risk variants among complex human traits/diseases. With the large number of completed GWASs, statistical methods using summary statistics have become increasingly important because of the inaccessibility of individual-level data. In this study, we propose the regression with marginal information (REMI) method, an l(1) penalized approach with estimated marginal effects and an estimated covariance matrix of the predictors with external reference samples. The proposed method is highly scalable and capable of analyzing multiple GWAS data sets from hundreds of thousands individuals and a large number of SNPs. We also establish an upper bound on the error of the REMI estimator, which has the same order as that of the minimax error bound of the Lasso with complete individual-level data. We conduct simulation studies to evaluate the performance of the proposed method. An interesting finding is that when there is a large number of marginal estimates available with a small number of reference samples, as in a GWAS, the proposed method yields good estimation and prediction results, outperforming the Lasso with complete data, but with a relatively small sample size. We apply the proposed method to the 10 traits GWAS data of the Northern Finland Birth Cohorts program. In particular, the real-data analysis results indicate that a summary-level-based analysis using the REMI outperforms an individual-level-based analysis when the sample size of the summary-level data is larger than that of the individual-level data. In summary, our theoretical and real-data results provide solid support for a summarylevel-based analysis. As a result, polygenic risk scores of a wide variety of complex diseases can be obtained using summary statistics with theoretically guaranteed performance. The developed R package and the code to reproduce the results are available at https: //github. com/gordonliu810822/REMI.
引用
收藏
页码:1985 / 2004
页数:20
相关论文
共 50 条
  • [1] Robustification of Linear Regression and Its Application in Genome-Wide Association Studies
    Alamin, Md
    Sultana, Most Humaira
    Xu, Haiming
    Mollah, Md Nurul Haque
    FRONTIERS IN GENETICS, 2020, 11
  • [2] Deshrinking ridge regression for genome-wide association studies
    Wang, Meiyue
    Li, Ruidong
    Xu, Shizhong
    BIOINFORMATICS, 2020, 36 (14) : 4154 - 4162
  • [3] Regularized regression method for genome-wide association studies
    Jin Liu
    Kai Wang
    Shuangge Ma
    Jian Huang
    BMC Proceedings, 5 (Suppl 9)
  • [4] Mixed logistic regression in genome-wide association studies
    Jacqueline Milet
    David Courtin
    André Garcia
    Hervé Perdry
    BMC Bioinformatics, 21
  • [5] Mixed logistic regression in genome-wide association studies
    Milet, Jacqueline
    Courtin, David
    Garcia, Andre
    Perdry, Herve
    BMC BIOINFORMATICS, 2020, 21 (01)
  • [6] Absolute Fused Lasso and Its Application to Genome-Wide Association Studies
    Yang, Tao
    Liu, Jun
    Gong, Pinghua
    Zhang, Ruiwen
    Shen, Xiaotong
    Ye, Jieping
    KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, : 1955 - 1964
  • [7] Software for Genome-Wide Association Studies in Autopolyploids and Its Application to Potato
    Rosyara, Umesh R.
    De Jong, Walter S.
    Douches, David S.
    Endelman, Jeffrey B.
    PLANT GENOME, 2016, 9 (02):
  • [8] Empirical Saddlepoint Approximation and Its Application to Genome-Wide Association Studies
    Ma, Yuzhuo
    Bi, Wenjian
    Zhang, Ji-Feng
    2021 PROCEEDINGS OF THE 40TH CHINESE CONTROL CONFERENCE (CCC), 2021, : 6380 - 6385
  • [9] Annotation Regression for Genome-Wide Association Studies with an Application to Psychiatric Genomic Consortium Data
    Shin S.
    Keleş S.
    Statistics in Biosciences, 2017, 9 (1) : 50 - 72
  • [10] Optimal use of regression models in genome-wide association studies
    Powell, J. E.
    Kranis, A.
    Floyd, J.
    Dekkers, J. C. M.
    Knott, S.
    Haley, C. S.
    ANIMAL GENETICS, 2012, 43 (02) : 133 - 143