A comparison of model selection methods for prediction in the presence of multiply imputed data

被引:31
|
作者
Le Thi Phuong Thao [1 ]
Geskus, Ronald [1 ,2 ]
机构
[1] Univ Oxford, Biostat Grp, Clin Res Unit, Ho Chi Minh City, Vietnam
[2] Univ Oxford, Nuffield Dept Med, Oxford, England
基金
英国惠康基金;
关键词
lasso; multiply imputed data; prediction; stacked data; variable selection; VARIABLE SELECTION;
D O I
10.1002/bimj.201700232
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Many approaches for variable selection with multiply imputed data in the development of a prognostic model have been proposed. However, no method prevails as uniformly best. We conducted a simulation study with a binary outcome and a logistic regression model to compare two classes of variable selection methods in the presence of MI data: (I) Model selection on bootstrap data, using backward elimination based on AIC or lasso, and fit the final model based on the most frequently (e.g. >= 50%) selected variables over all MI and bootstrap data sets; (II) Model selection on original MI data, using lasso. The final model is obtained by (i) averaging estimates of variables that were selected in any MI data set or (ii) in 50% of the MI data; (iii) performing lasso on the stacked MI data, and (iv) as in (iii) but using individual weights as determined by the fraction of missingness. In all lasso models, we used both the optimal penalty and the 1-se rule. We considered recalibrating models to correct for overshrinkage due to the suboptimal penalty by refitting the linear predictor or all individual variables. We applied the methods on a real dataset of 951 adult patients with tuberculous meningitis to predict mortality within nine months. Overall, applying lasso selection with the 1-se penalty shows the best performance, both in approach I and II. Stacking MI data is an attractive approach because it does not require choosing a selection threshold when combining results from separate MI data sets
引用
收藏
页码:343 / 356
页数:14
相关论文
共 50 条
  • [31] A new framework for managing and analyzing multiply imputed data in Stata
    Carlin, John B.
    Galati, John C.
    Royston, Patrick
    STATA JOURNAL, 2008, 8 (01): : 49 - 67
  • [32] Likelihood-based inference for singly and multiply imputed synthetic data under a normal model
    Klein, Martin
    Sinha, Bimal
    STATISTICS & PROBABILITY LETTERS, 2015, 105 : 168 - 175
  • [33] Model specification and bootstrapping for multiply imputed data: An application to count models for the frequency of alcohol use
    Comulada, W. Scott
    Stata Journal, 2015, 15 (03): : 833 - 844
  • [34] The Fay-Herriot model for multiply imputed data with an application to regional wealth estimation in Germany
    Kreutzmann, Ann-Kristin
    Marek, Philipp
    Runge, Marina
    Salvati, Nicola
    Schmid, Timo
    JOURNAL OF APPLIED STATISTICS, 2022, 49 (13) : 3278 - 3299
  • [35] The estimation and use of predictions for the assessment of model performance using large samples with multiply imputed data
    Wood, Angela M.
    Royston, Patrick
    White, Ian R.
    BIOMETRICAL JOURNAL, 2015, 57 (04) : 614 - 632
  • [36] COMPARISON OF VARIABLE SELECTION METHODS FOR OPTIMIZING THE CALIBRATION OF CLINICAL PREDICTION MODEL
    Shiko, Yuki
    Takashima, Ikumi
    Dan, Ippeita
    Kawasaki, Yohei
    JP JOURNAL OF BIOSTATISTICS, 2021, 18 (02) : 269 - 294
  • [37] Addressing health disparities using multiply imputed injury surveillance data
    Liu, Yang
    Wolkin, Amy F.
    Kresnow, Marcie-jo
    Schroeder, Thomas
    INTERNATIONAL JOURNAL FOR EQUITY IN HEALTH, 2023, 22 (01)
  • [38] REGRESSION WITH MISSING YS: AN IMPROVED STRATEGY FOR ANALYZING MULTIPLY IMPUTED DATA
    von Hippel, Paul T.
    SOCIOLOGICAL METHODOLOGY 2007, VOL 37, 2007, 37 : 83 - 117
  • [39] Multiply Imputed Synthetic Data: Evaluation of Hierarchical Bayesian Imputation Models
    Graham, Patrick
    Young, Jim
    Penny, Richard
    JOURNAL OF OFFICIAL STATISTICS, 2009, 25 (02) : 245 - 268
  • [40] PERFORMING LIKELIHOOD RATIO TESTS WITH MULTIPLY-IMPUTED DATA SETS
    MENG, XL
    RUBIN, DB
    BIOMETRIKA, 1992, 79 (01) : 103 - 111