A comparison of model selection methods for prediction in the presence of multiply imputed data

被引:31
|
作者
Le Thi Phuong Thao [1 ]
Geskus, Ronald [1 ,2 ]
机构
[1] Univ Oxford, Biostat Grp, Clin Res Unit, Ho Chi Minh City, Vietnam
[2] Univ Oxford, Nuffield Dept Med, Oxford, England
基金
英国惠康基金;
关键词
lasso; multiply imputed data; prediction; stacked data; variable selection; VARIABLE SELECTION;
D O I
10.1002/bimj.201700232
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Many approaches for variable selection with multiply imputed data in the development of a prognostic model have been proposed. However, no method prevails as uniformly best. We conducted a simulation study with a binary outcome and a logistic regression model to compare two classes of variable selection methods in the presence of MI data: (I) Model selection on bootstrap data, using backward elimination based on AIC or lasso, and fit the final model based on the most frequently (e.g. >= 50%) selected variables over all MI and bootstrap data sets; (II) Model selection on original MI data, using lasso. The final model is obtained by (i) averaging estimates of variables that were selected in any MI data set or (ii) in 50% of the MI data; (iii) performing lasso on the stacked MI data, and (iv) as in (iii) but using individual weights as determined by the fraction of missingness. In all lasso models, we used both the optimal penalty and the 1-se rule. We considered recalibrating models to correct for overshrinkage due to the suboptimal penalty by refitting the linear predictor or all individual variables. We applied the methods on a real dataset of 951 adult patients with tuberculous meningitis to predict mortality within nine months. Overall, applying lasso selection with the 1-se penalty shows the best performance, both in approach I and II. Stacking MI data is an attractive approach because it does not require choosing a selection threshold when combining results from separate MI data sets
引用
收藏
页码:343 / 356
页数:14
相关论文
共 50 条
  • [1] Order selection tests with multiply imputed data
    Consentino, Fabrizio
    Claeskens, Gerda
    COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2010, 54 (10) : 2284 - 2295
  • [2] Variable selection and prediction of clinical outcome with multiply-imputed data via Bayesian model averaging
    Jiang, Guozhi
    Tam, Claudia H. T.
    Luk, Andrea O. Y.
    Kong, Alice P. S.
    So, Wing Yee
    Chan, Juliana C. N.
    Ma, Ronald C. W.
    Fan, Xiaodan
    2016 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2016, : 727 - 730
  • [3] Model selection of generalized estimating equations with multiply imputed longitudinal data
    Shen, Chung-Wei
    Chen, Yi-Hau
    BIOMETRICAL JOURNAL, 2013, 55 (06) : 899 - 911
  • [4] How should variable selection be performed with multiply imputed data?
    Wood, Angela M.
    White, Ian R.
    Royston, Patrick
    STATISTICS IN MEDICINE, 2008, 27 (17) : 3227 - 3246
  • [5] Validation of prediction models based on lasso regression with multiply imputed data
    Musoro, Jammbe Z.
    Zwinderman, Aeilko H.
    Puhan, Milo A.
    ter Riet, Gerben
    Geskus, Ronald B.
    BMC MEDICAL RESEARCH METHODOLOGY, 2014, 14
  • [6] Validation of prediction models based on lasso regression with multiply imputed data
    Jammbe Z Musoro
    Aeilko H Zwinderman
    Milo A Puhan
    Gerben ter Riet
    Ronald B Geskus
    BMC Medical Research Methodology, 14
  • [7] Pooling Methods for Likelihood Ratio Tests in Multiply Imputed Data Sets
    Grund, Simon
    Ludtke, Oliver
    Robitzsch, Alexander
    PSYCHOLOGICAL METHODS, 2023, 28 (05) : 1207 - 1221
  • [8] Power calculation in multiply imputed data
    Ruochen Zha
    Ofer Harel
    Statistical Papers, 2021, 62 : 533 - 559
  • [9] MULTIPLY ROBUST BOOTSTRAP VARIANCE ESTIMATION IN THE PRESENCE OF SINGLY IMPUTED SURVEY DATA
    Chen, Sixia
    Haziza, David
    Mashreghi, Zeinab
    JOURNAL OF SURVEY STATISTICS AND METHODOLOGY, 2021, 9 (04) : 810 - 832
  • [10] Variable selection for multiply-imputed data with application to dioxin exposure study
    Chen, Qixuan
    Wang, Sijian
    STATISTICS IN MEDICINE, 2013, 32 (21) : 3646 - 3659