A simple pooling method for variable selection in multiply imputed datasets outperformed complex methods

被引：12

作者：

Panken, A. M. ^{[1
,2
]}

Heymans, M. W. ^{[1
]}

机构：

[1] Vrije Univ Amsterdam, Amsterdam Publ Hlth Res Inst, Amsterdam UMC, Dept Epidemiol & Data Sci, Amsterdam, Netherlands

[2] Phys Therapy Practice Panken, Roermond, Netherlands

来源：

BMC MEDICAL RESEARCH METHODOLOGY | 2022年 / 22卷 / 01期

关键词：

Logistic regression; Median-p-rule; Multiple imputation; Pooling selection methods; Variable selection; IMPUTATION; VALUES;

D O I：

10.1186/s12874-022-01693-8

中图分类号：

R19 [保健组织与事业（卫生事业管理）];

学科分类号：

摘要：

Background For the development of prognostic models, after multiple imputation, variable selection is advised to be applied from the pooled model. The aim of this study is to evaluate by using a simulation study and practical data example the performance of four different pooling methods for variable selection in multiple imputed datasets. These methods are the D1, D2, D3 and recently extended Median-P-Rule (MPR) for categorical, dichotomous, and continuous variables in logistic regression models. Methods Four datasets (n = 200 and n = 500), with 9 variables and correlations of respectively 0.2 and 0.6 between these variables, were simulated. These datasets included 2 categorical and 2 continuous variables with 20% missing at random data. Multiple Imputation (m = 5) was applied, and the four methods were compared with selection from the full model (without missing data). The same analyzes were repeated in five multiply imputed real-world datasets (NHANES) (m = 5, p = 0.05, N = 250/300/400/500/1000). Results In the simulated datasets, the differences between the pooling methods were most evident in the smaller datasets. The MPR performed equal to all other pooling methods for the selection frequency, as well as for the P-values of the continuous and dichotomous variables, however the MPR performed consistently better for pooling and selecting categorical variables in multiply imputed datasets and also regarding the stability of the selected prognostic models. Analyzes in the NHANES-dataset showed that all methods mostly selected the same models. Compared to each other however, the D2-method seemed to be the least sensitive and the MPR the most sensitive, most simple, and easy method to apply. Conclusions Considering that MPR is the most simple and easy pooling method to use for epidemiologists and applied researchers, we carefully recommend using the MPR-method to pool categorical variables with more than two levels after Multiple Imputation in combination with Backward Selection-procedures (BWS). Because MPR never performed worse than the other methods in continuous and dichotomous variables we also advice to use MPR in these types of variables.

引用

页数：11

共 34 条

[1] A simple pooling method for variable selection in multiply imputed datasets outperformed complex methods
A. M. Panken
M. W. Heymans
BMC Medical Research Methodology, 22
[2] Variable Selection with Multiply-Imputed Datasets: Choosing Between Stacked and Grouped Methods
Du, Jiacong
Boss, Jonathan
Han, Peisong
Beesley, Lauren J.
Kleinsasser, Michael
Goutman, Stephen A.
Batterman, Stuart
Feldman, Eva L.
Mukherjee, Bhramar
JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 2022, 31 (04) : 1063 - 1075
[3] Pooling ANOVA Results From Multiply Imputed Datasets A Simulation Study
Grund, Simon
Luedtke, Oliver
Robitzsch, Alexander
METHODOLOGY-EUROPEAN JOURNAL OF RESEARCH METHODS FOR THE BEHAVIORAL AND SOCIAL SCIENCES, 2016, 12 (03) : 75 - 88
[4] Pooling test statistics across multiply imputed datasets for nonnormal items
Fan Jia
Behavior Research Methods, 2024, 56 : 1229 - 1243
[5] Pooling test statistics across multiply imputed datasets for nonnormal items
Jia, Fan
BEHAVIOR RESEARCH METHODS, 2023, 56 (3) : 1229 - 1243
[6] Pooling Methods for Likelihood Ratio Tests in Multiply Imputed Data Sets
Grund, Simon
Ludtke, Oliver
Robitzsch, Alexander
PSYCHOLOGICAL METHODS, 2023, 28 (05) : 1207 - 1221
[7] How should variable selection be performed with multiply imputed data?
Wood, Angela M.
White, Ian R.
Royston, Patrick
STATISTICS IN MEDICINE, 2008, 27 (17) : 3227 - 3246
[8] Variable selection for multiply-imputed data with application to dioxin exposure study
Chen, Qixuan
Wang, Sijian
STATISTICS IN MEDICINE, 2013, 32 (21) : 3646 - 3659
[9] A comparison of model selection methods for prediction in the presence of multiply imputed data
Le Thi Phuong Thao
Geskus, Ronald
BIOMETRICAL JOURNAL, 2019, 61 (02) : 343 - 356
[10] Variable selection for multiply-imputed data with penalized generalized estimating equations
Geronimi, J.
Saporta, G.
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2017, 110 : 103 - 114

← 1 2 3 4 →