Predictive performance of missing data methods for logistic regression, classification trees and neural networks

被引:0
|
作者
Schmid, CH [1 ]
Terrin, N
Griffith, JL
D'Agostino, RB
Selker, HP
机构
[1] Tufts Univ, Medford, MA 02155 USA
[2] Boston Univ, Boston, MA 02215 USA
关键词
calibration; discrimination; imputation; nonlinear models; simulation; splines;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Although the effect of missing data on regression estimates has received considerable attention, their effect on predictive performance has been neglected. We studied the performance of three missing data strategies-omission of records with missing values, replacement with a mean and imputation based on regression-on the predictive performance of logistic regression (I-R), classification tree (CT) and neural network (NN) models in the presence of data missing completely at random (MCAR). Models were constructed using datasets of size 500 simulated from a joint distribution of binary and continuous predictors including nonlinearities, collinearity and interactions between variables. Though omission produced models that fit better on the data from which the models were developed, imputation was superior on average to omission for all models when evaluating the receiver operating characteristic (ROC) curve area, mean squared error (MSE), pooled variance across outcome categories and calibration chi(2) on an independently generated test set. However, in about one-third of simulations, omission performed better. Performance was also more variable with omission including quite a few instances of extremely poor performance. Replacement and imputation generally produced similar results except with neural networks for which replacement, the strategy typically used in neural network algorithms, was inferior to imputation. Missing data affected simpler models much less than they did more complex models such as generalized additive models that focus on local structure. For moderate sized datasets, logistic regressions that use simple nonlinear structures such as quadratic terms and piecewise linear splines appear to be at least as robust to randomly missing values as neural networks and classification trees.
引用
收藏
页码:115 / 140
页数:26
相关论文
共 50 条
  • [21] CLASSIFICATION OF URBAN AERIAL DATA BASED ON PIXEL LABELLING WITH DEEP CONVOLUTIONAL NEURAL NETWORKS AND LOGISTIC REGRESSION
    Yao, W.
    Poleswki, P.
    Krzystek, P.
    XXIII ISPRS CONGRESS, COMMISSION VII, 2016, 41 (B7): : 405 - 410
  • [22] CONDITIONAL LOGISTIC-REGRESSION WITH MISSING DATA
    GIBBONS, LE
    HOSMER, DW
    COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 1991, 20 (01) : 109 - 120
  • [23] Reconstructing missing daily precipitation data using regression trees and artificial neural networks for SWAT streamflow simulation
    Kim, Jung-Woo
    Pachepsky, Yakov A.
    JOURNAL OF HYDROLOGY, 2010, 394 (3-4) : 305 - 314
  • [24] Relative performance of artificial neural networks and regression models in predicting missing water quality data
    Tyagi, Punam
    Chandramouli, V.
    Lingireddy, Srinivasa
    Buddhi, D.
    ENVIRONMENTAL ENGINEERING SCIENCE, 2008, 25 (05) : 657 - 668
  • [25] Neural networks, logistic regression, and calibration
    Steyerberg, EW
    MEDICAL DECISION MAKING, 1998, 18 (03) : 349 - 350
  • [26] Classification Methods Based on Fitting Logistic Regression to Positive and Unlabeled Data
    Furmanczyk, Konrad
    Paczutkowski, Kacper
    Dudzinski, Marcin
    Dziewa-Dawidczyk, Diana
    COMPUTATIONAL SCIENCE - ICCS 2022, PT I, 2022, : 31 - 45
  • [27] Predictive Data Analytics using Logistic Regression for Licensure Examination Performance
    Juanatas, Irish C.
    Juanatas, Roben A.
    PROCEEDINGS OF 2019 INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND KNOWLEDGE ECONOMY (ICCIKE' 2019), 2019, : 251 - 255
  • [28] Data mining methods in the prediction of Dementia: A real-data comparison of the accuracy, sensitivity and specificity of linear discriminant analysis, logistic regression, neural networks, support vector machines, classification trees and random forests
    Maroco J.
    Silva D.
    Rodrigues A.
    Guerreiro M.
    Santana I.
    De Mendonça A.
    BMC Research Notes, 4 (1)
  • [29] Classification of Online Toxic Comments Using the Logistic Regression and Neural Networks Models
    Saif, Mujahed A.
    Medvedev, Alexander N.
    Medvedev, Maxim A.
    Atanasova, Todorka
    PROCEEDINGS OF THE 44TH INTERNATIONAL CONFERENCE "APPLICATIONS OF MATHEMATICS IN ENGINEERING AND ECONOMICS", 2018, 2048
  • [30] Mapping wetlands using ASTER data: a comparison between classification trees and logistic regression
    Pantaleoni, E.
    Wynne, R. H.
    Galbraith, J. M.
    Campbell, J. B.
    INTERNATIONAL JOURNAL OF REMOTE SENSING, 2009, 30 (13) : 3423 - 3440