Ultra Data-Oriented Parallel Fractional Hot-Deck Imputation With Efficient Linearized Variance Estimation

被引:1
|
作者
Yang, Yicheng [1 ]
Kwon, Yonghyun [2 ]
Kim, Jae Kwang [2 ]
Cho, In Ho [1 ]
机构
[1] Iowa State Univ, Dept Civil Engn, Ames, IA 50011 USA
[2] Iowa State Univ, Dept Stat, Ames, IA 50011 USA
基金
美国国家科学基金会;
关键词
Deep learning; parallel linearized variance estimation; two-staged feature selection; ultra data-oriented parallel fractional hot-deck imputation; ultra incomplete data; ultrahigh dimensional missing data curing; MULTIPLE IMPUTATION; INCOMPLETE DATA; SELECTION; LIKELIHOOD; REGRESSION; MODELS;
D O I
10.1109/TKDE.2023.3249567
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Parallel fractional hot-deck imputation (P-FHDI (Yang et al. 2020)) is a general-purpose, assumption-free tool for handling item nonresponse in big incomplete data by combining the theory of FHDI and parallel computing. FHDI cures multi-variate missing data by filling each missing unit with multiple observed values (thus, hot-deck) without resorting to distributional assumptions. P-FHDI can tackle big incomplete data with millions of instances (big -n) or 10,000 variables (big -p). However, handling ultra incomplete data (i.e., concurrently big -n and big -p) with tremendous instances and high dimensionality has posed problems to P-FHDI due to excessive memory requirement and execution time. To tackle the aforementioned challenges, we propose the ultra data-oriented P-FHDI (named UP-FHDI) capable of curing ultra incomplete data. In addition to the parallel Jackknife method, this paper enables a computationally efficient ultra data-oriented variance estimation using parallel linearization techniques. Results confirm that UP-FHDI can tackle an ultra dataset with one million instances and 10,000 variables. This paper illustrates the special parallel algorithms of UP-FHDI and confirms its positive impact on the subsequent deep learning performance.
引用
收藏
页码:9754 / 9768
页数:15
相关论文
共 14 条
  • [1] Parallel Fractional Hot-Deck Imputation and Variance Estimation for Big Incomplete Data Curing
    Yang, Yicheng
    Kim, Jae Kwang
    Cho, In Ho
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2022, 34 (08) : 3912 - 3926
  • [2] Impacts of Fractional Hot-Deck Imputation on Learning and Prediction of Engineering Data
    Song, Ikkyun
    Yang, Yicheng
    Im, Jongho
    Tong, Tong
    Ceylan, Halil
    Cho, In Ho
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2020, 32 (12) : 2363 - 2373
  • [3] Jackknife variance estimation for multivariate statistics under hot-deck imputation from common donors
    Skinner, CJ
    Rao, JNK
    JOURNAL OF STATISTICAL PLANNING AND INFERENCE, 2002, 102 (01) : 149 - 167
  • [4] FINDING A FLEXIBLE HOT-DECK IMPUTATION METHOD FOR MULTINOMIAL DATA
    Andridge, Rebecca
    Bechtel, Laura
    Thompson, Katherine Jenny
    JOURNAL OF SURVEY STATISTICS AND METHODOLOGY, 2021, 9 (04) : 789 - 809
  • [5] Using the Fractional Imputation Methodology to Evaluate Variance due to Hot Deck Imputation in Survey Data
    Perez, Adriana
    JOURNAL OF MODERN APPLIED STATISTICAL METHODS, 2007, 6 (01) : 248 - 257
  • [6] A global Water Quality Index and hot-deck imputation of missing data
    Srebotnjak, Tanja
    Carr, Genevieve
    de Sherbinin, Alexander
    Rickwood, Carrie
    ECOLOGICAL INDICATORS, 2012, 17 : 108 - 119
  • [7] JACKKNIFE VARIANCE-ESTIMATION WITH SURVEY DATA UNDER HOT DECK IMPUTATION
    RAO, JNK
    SHAO, J
    BIOMETRIKA, 1992, 79 (04) : 811 - 822
  • [8] Bootstrap methods for imputed data from regression, ratio and hot-deck imputation
    Mashreghi, Zeinab
    Leger, Christian
    Haziza, David
    CANADIAN JOURNAL OF STATISTICS-REVUE CANADIENNE DE STATISTIQUE, 2014, 42 (01): : 142 - 167
  • [9] A hot-deck multiple imputation procedure for gaps in longitudinal data on recurrent events
    Little, Roderick J.
    Yosef, Matheos
    Cain, Kevin C.
    Nan, Bin
    Harlow, Sioban D.
    STATISTICS IN MEDICINE, 2008, 27 (01) : 103 - 120
  • [10] Multiple hot-deck imputation for network inference from RNA sequencing data
    Imbert, Alyssa
    Valsesia, Armand
    Le Gall, Caroline
    Armenise, Claudia
    Lefebvre, Gregory
    Gourraud, Pierre-Antoine
    Viguerie, Nathalie
    Villa-Vialaneix, Nathalie
    BIOINFORMATICS, 2018, 34 (10) : 1726 - 1732