Ultra Data-Oriented Parallel Fractional Hot-Deck Imputation With Efficient Linearized Variance Estimation

被引：1

作者：

Yang, Yicheng ^{[1
]}

Kwon, Yonghyun ^{[2
]}

Kim, Jae Kwang ^{[2
]}

Cho, In Ho ^{[1
]}

机构：

[1] Iowa State Univ, Dept Civil Engn, Ames, IA 50011 USA

[2] Iowa State Univ, Dept Stat, Ames, IA 50011 USA

来源：

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING | 2023年 / 35卷 / 09期

基金：

美国国家科学基金会;

关键词：

Deep learning; parallel linearized variance estimation; two-staged feature selection; ultra data-oriented parallel fractional hot-deck imputation; ultra incomplete data; ultrahigh dimensional missing data curing; MULTIPLE IMPUTATION; INCOMPLETE DATA; SELECTION; LIKELIHOOD; REGRESSION; MODELS;

D O I：

10.1109/TKDE.2023.3249567

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Parallel fractional hot-deck imputation (P-FHDI (Yang et al. 2020)) is a general-purpose, assumption-free tool for handling item nonresponse in big incomplete data by combining the theory of FHDI and parallel computing. FHDI cures multi-variate missing data by filling each missing unit with multiple observed values (thus, hot-deck) without resorting to distributional assumptions. P-FHDI can tackle big incomplete data with millions of instances (big -n) or 10,000 variables (big -p). However, handling ultra incomplete data (i.e., concurrently big -n and big -p) with tremendous instances and high dimensionality has posed problems to P-FHDI due to excessive memory requirement and execution time. To tackle the aforementioned challenges, we propose the ultra data-oriented P-FHDI (named UP-FHDI) capable of curing ultra incomplete data. In addition to the parallel Jackknife method, this paper enables a computationally efficient ultra data-oriented variance estimation using parallel linearization techniques. Results confirm that UP-FHDI can tackle an ultra dataset with one million instances and 10,000 variables. This paper illustrates the special parallel algorithms of UP-FHDI and confirms its positive impact on the subsequent deep learning performance.

引用

页码：9754 / 9768

页数：15

共 14 条

[1] Parallel Fractional Hot-Deck Imputation and Variance Estimation for Big Incomplete Data Curing
Yang, Yicheng
Kim, Jae Kwang
Cho, In Ho
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2022, 34 (08) : 3912 - 3926
[2] Impacts of Fractional Hot-Deck Imputation on Learning and Prediction of Engineering Data
Song, Ikkyun
Yang, Yicheng
Im, Jongho
Tong, Tong
Ceylan, Halil
Cho, In Ho
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2020, 32 (12) : 2363 - 2373
[3] Jackknife variance estimation for multivariate statistics under hot-deck imputation from common donors
Skinner, CJ
Rao, JNK
JOURNAL OF STATISTICAL PLANNING AND INFERENCE, 2002, 102 (01) : 149 - 167
[4] FINDING A FLEXIBLE HOT-DECK IMPUTATION METHOD FOR MULTINOMIAL DATA
Andridge, Rebecca
Bechtel, Laura
Thompson, Katherine Jenny
JOURNAL OF SURVEY STATISTICS AND METHODOLOGY, 2021, 9 (04) : 789 - 809
[5] Using the Fractional Imputation Methodology to Evaluate Variance due to Hot Deck Imputation in Survey Data
Perez, Adriana
JOURNAL OF MODERN APPLIED STATISTICAL METHODS, 2007, 6 (01) : 248 - 257
[6] A global Water Quality Index and hot-deck imputation of missing data
Srebotnjak, Tanja
Carr, Genevieve
de Sherbinin, Alexander
Rickwood, Carrie
ECOLOGICAL INDICATORS, 2012, 17 : 108 - 119
[7] JACKKNIFE VARIANCE-ESTIMATION WITH SURVEY DATA UNDER HOT DECK IMPUTATION
RAO, JNK
SHAO, J
BIOMETRIKA, 1992, 79 (04) : 811 - 822
[8] Bootstrap methods for imputed data from regression, ratio and hot-deck imputation
Mashreghi, Zeinab
Leger, Christian
Haziza, David
CANADIAN JOURNAL OF STATISTICS-REVUE CANADIENNE DE STATISTIQUE, 2014, 42 (01): : 142 - 167
[9] A hot-deck multiple imputation procedure for gaps in longitudinal data on recurrent events
Little, Roderick J.
Yosef, Matheos
Cain, Kevin C.
Nan, Bin
Harlow, Sioban D.
STATISTICS IN MEDICINE, 2008, 27 (01) : 103 - 120
[10] Multiple hot-deck imputation for network inference from RNA sequencing data
Imbert, Alyssa
Valsesia, Armand
Le Gall, Caroline
Armenise, Claudia
Lefebvre, Gregory
Gourraud, Pierre-Antoine
Viguerie, Nathalie
Villa-Vialaneix, Nathalie
BIOINFORMATICS, 2018, 34 (10) : 1726 - 1732

← 1 2 →