Approximate Imputation Method for Missing Data in Machine Learning

被引:0
|
作者
Cao W. [1 ]
Chu Y. [1 ]
Li X. [1 ]
机构
[1] National Key Laboratory of Science and Technology on Blind Signal Processing
来源
| 1600年 / Xi'an Jiaotong University卷 / 51期
关键词
Imputation method; Machine learning; Missing attributes; Quadratic programming;
D O I
10.7652/xjtuxb201710023
中图分类号
学科分类号
摘要
An approximate imputation method called k-ANNO is proposed to handle the problems of missing data in machine learning field given a missing sample. The proposed method begins by constructing an offline graph to approximately search nearest neighbors of the partially missing sample efficiently. Then a fast quadratic programming algorithm is utilized to determine the optimal weight for each neighbor. Finally, unmissed parts of the neighbors are used to impute the missing attributes by the estimated weights. Users get the freedom to weigh up between efficiency and imputation accuracy. The widespread data missing problems are well solved in this paper and k-ANNO is able to depress the impact of missing data effectively. Experiments on various well known datasets show that when the speedup rate parameters are between 2 and 10, k-ANNO method outperforms existing ones such as mean imputation or C-Means imputation etc. and the classification error and the regression error are 1% to 4% and 0.5-2.0 lower than those, respectively. Meanwhile, k-ANNO outperforms naïve k-NN imputation with a faster efficiency increased by 35%-320% faster. © 2017, Editorial Office of Journal of Xi'an Jiaotong University. All right reserved.
引用
收藏
页码:142 / 148
页数:6
相关论文
共 15 条
  • [1] Yang L., Li G., Zhang P., Simulation on communication optimization of wireless sensor networks under missing data, Computer Simulation, 30, 12, pp. 249-252, (2013)
  • [2] Meng J., Li C., Missing data imputation for categorical data based on random forest model, Statistics & Information Forum, 9, pp. 86-90, (2014)
  • [3] Wu X., Li G., Yi D., Et al., Study on the algorithm of non-parameter deletion forest filling for gene expression profiles, Chinese Journal of Health Statistics, 33, 6, pp. 1068-1070, (2016)
  • [4] Zhang X., Cheng Y., Imputation of missing values for compositional data based on random forest, Chinese Journal of Applied Probability and Statistics, 33, 1, pp. 102-110, (2017)
  • [5] Farhangfar A., Kurgan L., Dy J.G., Et al., Impact of imputation of missing values on classification error for discrete data, Pattern Recognition, 41, 12, pp. 3692-3705, (2008)
  • [6] Ouyang M., Welsh W.J., Georgopoulos P., Gaussian mixture clustering and imputation of microarray data, Bioinformatics, 20, 6, pp. 917-923, (2004)
  • [7] Ding Y., Ross A., A comparison of imputation methods for handling missing scores in biometric fusion, Pattern Recognition, 45, 3, pp. 919-933, (2012)
  • [8] Kang P., Locally linear reconstruction based missing value imputation for supervised learning, Neurocomputing, 118, 11, pp. 65-78, (2013)
  • [9] Zhang S., Yang H., Missing data completion based on an improved K-neighbor algorithm, Computers and Applied Chemistry, 32, 12, pp. 1499-1503, (2015)
  • [10] Liu Z.G., Pan Q., Dezert J., Et al., Adaptive imputation of missing values for incomplete pattern classification, Pattern Recognition, 52, C, pp. 85-95, (2016)