Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach

被引:37
|
作者
Rodriguez, Rafael [1 ]
Pastorini, Marcos [2 ]
Etcheverry, Lorena [2 ]
Chreties, Christian [1 ]
Fossati, Monica [1 ]
Castro, Alberto [2 ]
Gorgoglione, Angela [1 ]
机构
[1] Univ Republica, Fac Ingn, Inst Mecan Fluidos & Ingn Ambiental IMFIA, Montevideo 11300, Uruguay
[2] Univ Republica, Fac Ingn, Inst Computac InCo, Montevideo 11300, Uruguay
关键词
data scarcity; water quality; missing data; univariate imputation; multivariate imputation; machine learning; hydroinformatics; PRECIPITATION RECORDS; TEMPERATURE; ACCURACY; IMPROVE; RUNOFF; RIVER; IDW;
D O I
10.3390/su13116318
中图分类号
X [环境科学、安全科学];
学科分类号
08 ; 0830 ;
摘要
The monitoring of surface-water quality followed by water-quality modeling and analysis are essential for generating effective strategies in surface-water-resource management. However, worldwide, particularly in developing countries, water-quality studies are limited due to the lack of a complete and reliable dataset of surface-water-quality variables. In this context, several statistical and machine-learning models were assessed for imputing water-quality data at six monitoring stations located in the Santa Lucia Chico river (Uruguay), a mixed lotic and lentic river system. The challenge of this study is represented by the high percentage of missing data (between 50% and 70%) and the high temporal and spatial variability that characterizes the water-quality variables. The competing algorithms implement univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Hubber Regressor (HR), Support Vector Regressor (SVR) and K-nearest neighbors Regressor (KNNR)). According to the results, more than 76% of the imputation outcomes are considered "satisfactory" (NSE > 0.45). The imputation performance shows better results at the monitoring stations located inside the reservoir than those positioned along the mainstream. IDW was the model with the best imputation results, followed by RFR, HR and SVR. The approach proposed in this study is expected to aid water-resource researchers and managers in augmenting water-quality datasets and overcoming the missing data issue to increase the number of future studies related to the water-quality matter.
引用
收藏
页数:17
相关论文
共 50 条
  • [41] Proactive missing values imputation based on reinforcement learning
    Fountas, Panagiotis
    Kolomvatsos, Kostas
    COMPUTING, 2025, 107 (04)
  • [42] A Nonparametric Multiple Imputation Approach for Data with Missing Covariate Values with Application to Colorectal Adenoma Data
    Hsu, Chiu-Hsieh
    Long, Qi
    Li, Yisheng
    Jacobs, Elizabeth
    JOURNAL OF BIOPHARMACEUTICAL STATISTICS, 2014, 24 (03) : 634 - 648
  • [43] REGRESSION IMPUTATION OF MISSING VALUES IN LONGITUDINAL DATA SETS
    SCHNEIDERMAN, ED
    KOWALSKI, CJ
    WILLIS, SM
    INTERNATIONAL JOURNAL OF BIO-MEDICAL COMPUTING, 1993, 32 (02): : 121 - 133
  • [44] Treatment of missing values with imputation for the analysis of otologic data
    Laurikkala, J
    Kentala, E
    Juhola, M
    Pyykkö, I
    MEDICAL INFORMATICS EUROPE '99, 1999, 68 : 428 - 431
  • [45] Robust imputation method for missing values in microarray data
    Yoon, Dankyu
    Lee, Eun-Kyung
    Park, Taesung
    BMC BIOINFORMATICS, 2007, 8 (Suppl 2)
  • [46] A Novel Approach for Dealing with Missing Values in Machine Learning Datasets with Discrete Values
    Abu-Soud, Saleh M.
    2019 INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION SCIENCES (ICCIS), 2019, : 118 - 122
  • [47] Imputation of missing values in multi-view data
    van Loon, Wouter
    de Vos, Frank
    de Vos, Frank
    Koini, Marisa
    Schmidt, Reinhold
    de Rooij, Mark
    INFORMATION FUSION, 2024, 111
  • [48] Robust imputation method for missing values in microarray data
    Dankyu Yoon
    Eun-Kyung Lee
    Taesung Park
    BMC Bioinformatics, 8
  • [49] A Workflow for Missing Values Imputation of Untargeted Metabolomics Data
    Faquih, Tariq
    van Smeden, Maarten
    Luo, Jiao
    le Cessie, Saskia
    Kastenmueller, Gabi
    Krumsiek, Jan
    Noordam, Raymond
    van Heemst, Diana
    Rosendaal, Frits R.
    van Hylckama Vlieg, Astrid
    Willems van Dijk, Ko
    Mook-Kanamori, Dennis O.
    METABOLITES, 2020, 10 (12) : 1 - 23
  • [50] Proper Imputation Techniques for Missing Values in Data sets
    Aljuaid, Tahani
    Sasi, Sreela
    PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON DATA SCIENCE & ENGINEERING (ICDSE), 2016, : 146 - 150