Tree-based prediction on incomplete data using imputation or surrogate decisions

被引:54
|
作者
Valdiviezo, H. Cevallos [1 ]
Van Aelst, S. [1 ,2 ]
机构
[1] Univ Ghent, Dept Appl Math Comp Sci & Stat, B-9000 Ghent, Belgium
[2] Katholieke Univ Leuven, Dept Math, Sect Stat, B-3001 Louvain, Belgium
关键词
Prediction; Missing data; Surrogate decision; Multiple imputation; Conditional inference tree; MULTIPLE IMPUTATION; MISSING DATA; MICE;
D O I
10.1016/j.ins.2015.03.018
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The goal is to investigate the prediction performance of tree-based techniques when the available training data contains features with missing values. Also the future test cases may contain missing values and thus the methods should be able to generate predictions for such test cases. The missing values are handled either by using surrogate decisions within the trees or by the combination of an imputation method with a tree-based method. Missing values generated according to missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR) mechanisms are considered with various fractions of missing data. Imputation models are built in the learning phase and do not make use of the response variable, so that the resulting procedures allow to predict individual incomplete test cases. In the empirical comparison, both classification and regression problems are considered using a simulated and real-life datasets. The performance is evaluated by misclassification rate of predictions and mean squared prediction error, respectively. Overall, our results show that for smaller fractions of missing data an ensemble method combined with surrogates or single imputation suffices. For moderate to large fractions of missing values ensemble methods based on conditional inference trees combined with multiple imputation show the best performance, while conditional bagging using surrogates is a good alternative for high-dimensional prediction problems. Theoretical results confirm the potential better prediction performance of multiple imputation ensembles. (c) 2015 Elsevier Inc. All rights reserved.
引用
收藏
页码:163 / 181
页数:19
相关论文
共 50 条
  • [11] Software Defect Prediction using Tree-Based Ensembles
    Aljamaan, Hamoud
    Alazba, Amal
    PROCEEDINGS OF THE 16TH ACM INTERNATIONAL CONFERENCE ON PREDICTIVE MODELS AND DATA ANALYTICS IN SOFTWARE ENGINEERING, PROMISE 2020, 2020, : 1 - 10
  • [12] Evaluating a sequential tree-based procedure for multivariate imputation of complex missing data structures
    Borgoni, Riccardo
    Berrington, Ann
    QUALITY & QUANTITY, 2013, 47 (04) : 1991 - 2008
  • [13] Evaluating a sequential tree-based procedure for multivariate imputation of complex missing data structures
    Riccardo Borgoni
    Ann Berrington
    Quality & Quantity, 2013, 47 : 1991 - 2008
  • [14] A decision tree-based missing value imputation technique for data pre-processing
    Rahman, Md. Geaur
    Islam, Md. Zahidul
    Conferences in Research and Practice in Information Technology Series, 2010, 121 : 41 - 50
  • [15] Interpreting tree-based prediction models and their data in machining processes
    Bustillo, Andres
    Grzenda, Maciej
    Macukow, Bohdan
    INTEGRATED COMPUTER-AIDED ENGINEERING, 2016, 23 (04) : 349 - 367
  • [16] Tree-based disease classification using protein data
    Zhu, HT
    Yu, CY
    Zhang, HP
    PROTEOMICS, 2003, 3 (09) : 1673 - 1677
  • [17] Heart Disease Prediction Model Using Tree-based Methods
    Li, Yanran
    Liu, Yitong
    Luo, Jin
    Sun, Xiao
    2ND INTERNATIONAL CONFERENCE ON APPLIED MATHEMATICS, MODELLING, AND INTELLIGENT COMPUTING (CAMMIC 2022), 2022, 12259
  • [18] Replica tree-based federated learning using limited data
    Ghilea, Ramona
    Rekik, Islem
    NEURAL NETWORKS, 2025, 186
  • [19] Short-Term Visibility Prediction Using Tree-Based Machine Learning Algorithms and Numerical Weather Prediction Data
    Kim, Bu-Yo
    Belorid, Miloslav
    Cha, Joo Wan
    WEATHER AND FORECASTING, 2022, 37 (12) : 2263 - 2274
  • [20] Modeling financial data using clustering and tree-based approaches
    Chen, F
    Figlewski, S
    Weigend, AS
    DATA MINING, 1998, : 35 - 51