A Benchmark for Data Imputation Methods

被引:80
|
作者
Jaeger, Sebastian [1 ]
Allhorn, Arndt [1 ]
Biessmann, Felix [1 ]
机构
[1] Beuth Univ Appl Sci, Berlin, Germany
来源
FRONTIERS IN BIG DATA | 2021年 / 4卷
关键词
data quality; data cleaning; imputation; missing data; benchmark; MCAR; MNAR; MAR; MISSING DATA; ERRORS;
D O I
10.3389/fdata.2021.693674
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With the increasing importance and complexity of data pipelines, data quality became one of the key challenges in modern software applications. The importance of data quality has been recognized beyond the field of data engineering and database management systems (DBMSs). Also, for machine learning (ML) applications, high data quality standards are crucial to ensure robust predictive performance and responsible usage of automated decision making. One of the most frequent data quality problems is missing values. Incomplete datasets can break data pipelines and can have a devastating impact on downstream ML applications when not detected. While statisticians and, more recently, ML researchers have introduced a variety of approaches to impute missing values, comprehensive benchmarks comparing classical and modern imputation approaches under fair and realistic conditions are underrepresented. Here, we aim to fill this gap. We conduct a comprehensive suite of experiments on a large number of datasets with heterogeneous data and realistic missingness conditions, comparing both novel deep learning approaches and classical ML imputation methods when either only test or train and test data are affected by missing data. Each imputation method is evaluated regarding the imputation quality and the impact imputation has on a downstream ML task. Our results provide valuable insights into the performance of a variety of imputation methods under realistic conditions. We hope that our results help researchers and engineers to guide their data preprocessing method selection for automated data quality improvement.
引用
收藏
页数:16
相关论文
共 50 条
  • [1] Imputation in well log data: A benchmark for machine learning methods
    Gama, Pedro H. T.
    Faria, Jackson
    Sena, Jessica
    Neves, Francisco
    Riffel, Vinicius R.
    Perez, Lucas
    Korenchendler, Andre
    Sobreira, Matheus C. A.
    Machado, Alexei M. C.
    COMPUTERS & GEOSCIENCES, 2025, 196
  • [2] Missing Data and Imputation Methods
    Schober, Patrick
    Vetter, Thomas R.
    ANESTHESIA AND ANALGESIA, 2020, 131 (05): : 1419 - 1420
  • [3] Imputation Methods for Incomplete Data
    Umathe, Vaishali H.
    Chaudhary, Gauri
    2015 INTERNATIONAL CONFERENCE ON INNOVATIONS IN INFORMATION, EMBEDDED AND COMMUNICATION SYSTEMS (ICIIECS), 2015,
  • [4] Imputation of data Missing Not at Random: Artificial generation and benchmark analysis
    Pereira, Ricardo Cardoso
    Abreu, Pedro Henriques
    Rodrigues, Pedro Pereira
    Figueiredo, Mario A. T.
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 249
  • [5] Alternative imputation methods for wage data
    1600, Publ by American Statistical Assoc, Alexandria, VA, USA
  • [6] Imputation Methods for scRNA Sequencing Data
    Wang, Mengyuan
    Gan, Jiatao
    Han, Changfeng
    Guo, Yanbing
    Chen, Kaihao
    Shi, Ya-zhou
    Zhang, Ben-gong
    APPLIED SCIENCES-BASEL, 2022, 12 (20):
  • [7] ImputeVIS: An Interactive Evaluator to Benchmark Imputation Techniques for Time Series Data
    Khayati, Mourad
    Nater, Quentin
    Pasquier, Jacques
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2024, 17 (12): : 4329 - 4332
  • [8] Deep Learning Methods for Omics Data Imputation
    Huang, Lei
    Song, Meng
    Shen, Hui
    Hong, Huixiao
    Gong, Ping
    Deng, Hong-Wen
    Zhang, Chaoyang
    BIOLOGY-BASEL, 2023, 12 (10):
  • [9] Comparison of alternative imputation methods for ordinal data
    Cugnata, Federica
    Salini, Silvia
    COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2017, 46 (01) : 315 - 330
  • [10] Imputation of missing longitudinal data: a comparison of methods
    Engels, JM
    Diehr, P
    JOURNAL OF CLINICAL EPIDEMIOLOGY, 2003, 56 (10) : 968 - 976