Nearest neighbor imputation algorithms: a critical evaluation

被引:437
作者
Beretta, Lorenzo [1 ]
Santaniello, Alessandro [1 ]
机构
[1] Fdn IRCCS Ca Granda Osped Maggiore Policlin, Referral Ctr System Autoimmune Dis, Milan, Italy
关键词
MULTIPLE IMPUTATION;
D O I
10.1186/s12911-016-0318-z
中图分类号
R-058 [];
学科分类号
摘要
Background: Nearest neighbor (NN) imputation algorithms are efficient methods to fill in missing data where each missing value on some records is replaced by a value obtained from related cases in the whole set of records. Besides the capability to substitute the missing data with plausible values that are as close as possible to the true value, imputation algorithms should preserve the original data structure and avoid to distort the distribution of the imputed variable. Despite the efficiency of NN algorithms little is known about the effect of these methods on data structure. Methods: Simulation on synthetic datasets with different patterns and degrees of missingness were conducted to evaluate the performance of NN with one single neighbor (1NN) and with k neighbors without (kNN) or with weighting (wkNN) in the context of different learning frameworks: plain set, reduced set after ReliefF filtering, bagging, random choice of attributes, bagging combined with random choice of attributes (Random-Forest-like method). Results: Whatever the framework, kNN usually outperformed 1NN in terms of precision of imputation and reduced errors in inferential statistics, 1NN was however the only method capable of preserving the data structure and data were distorted even when small values of k neighbors were considered; distortion was more severe for resampling schemas. Conclusions: The use of three neighbors in conjunction with ReliefF seems to provide the best trade-off between imputation error and preservation of the data structure. The very same conclusions can be drawn when imputation experiments were conducted on the single proton emission computed tomography (SPECTF) heart dataset after introduction of missing data completely at random.
引用
收藏
页数:12
相关论文
共 22 条
[1]   Multiple imputation for missing data - A cautionary tale [J].
Allison, PD .
SOCIOLOGICAL METHODS & RESEARCH, 2000, 28 (03) :301-309
[2]   A Review of Hot Deck Imputation for Survey Non-response [J].
Andridge, Rebecca R. ;
Little, Roderick J. A. .
INTERNATIONAL STATISTICAL REVIEW, 2010, 78 (01) :40-64
[3]  
[Anonymous], 1987, Statistical analysis with missing data
[4]   Applications of multiple imputation in medical studies: from AIDS as NHANES [J].
Barnard, J ;
Meng, XL .
STATISTICAL METHODS IN MEDICAL RESEARCH, 1999, 8 (01) :17-36
[5]  
Bay S. D., 1998, Machine Learning. Proceedings of the Fifteenth International Conference (ICML'98), P37
[6]  
Boriah S., 2008, RED, V30
[7]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[8]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[9]  
COIS F, 2005, P EUR S ART NEUR NET, P339
[10]  
Domigos P., 1997, Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, P155