Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach

被引:36
|
作者
Rodriguez, Rafael [1 ]
Pastorini, Marcos [2 ]
Etcheverry, Lorena [2 ]
Chreties, Christian [1 ]
Fossati, Monica [1 ]
Castro, Alberto [2 ]
Gorgoglione, Angela [1 ]
机构
[1] Univ Republica, Fac Ingn, Inst Mecan Fluidos & Ingn Ambiental IMFIA, Montevideo 11300, Uruguay
[2] Univ Republica, Fac Ingn, Inst Computac InCo, Montevideo 11300, Uruguay
关键词
data scarcity; water quality; missing data; univariate imputation; multivariate imputation; machine learning; hydroinformatics; PRECIPITATION RECORDS; TEMPERATURE; ACCURACY; IMPROVE; RUNOFF; RIVER; IDW;
D O I
10.3390/su13116318
中图分类号
X [环境科学、安全科学];
学科分类号
08 ; 0830 ;
摘要
The monitoring of surface-water quality followed by water-quality modeling and analysis are essential for generating effective strategies in surface-water-resource management. However, worldwide, particularly in developing countries, water-quality studies are limited due to the lack of a complete and reliable dataset of surface-water-quality variables. In this context, several statistical and machine-learning models were assessed for imputing water-quality data at six monitoring stations located in the Santa Lucia Chico river (Uruguay), a mixed lotic and lentic river system. The challenge of this study is represented by the high percentage of missing data (between 50% and 70%) and the high temporal and spatial variability that characterizes the water-quality variables. The competing algorithms implement univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Hubber Regressor (HR), Support Vector Regressor (SVR) and K-nearest neighbors Regressor (KNNR)). According to the results, more than 76% of the imputation outcomes are considered "satisfactory" (NSE > 0.45). The imputation performance shows better results at the monitoring stations located inside the reservoir than those positioned along the mainstream. IDW was the model with the best imputation results, followed by RFR, HR and SVR. The approach proposed in this study is expected to aid water-resource researchers and managers in augmenting water-quality datasets and overcoming the missing data issue to increase the number of future studies related to the water-quality matter.
引用
收藏
页数:17
相关论文
共 50 条
  • [1] A Novel Index Measure Imputation Algorithm for Missing Data Values: A Machine Learning Approach
    Madhu, G.
    Rajinikanth, T. V.
    2012 IEEE INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND COMPUTING RESEARCH (ICCIC), 2012, : 81 - 87
  • [2] The impact of imputation quality on machine learning classifiers for datasets with missing values
    Tolou Shadbahr
    Michael Roberts
    Jan Stanczuk
    Julian Gilbey
    Philip Teare
    Sören Dittmer
    Matthew Thorpe
    Ramon Viñas Torné
    Evis Sala
    Pietro Lió
    Mishal Patel
    Jacobus Preller
    James H. F. Rudd
    Tuomas Mirtti
    Antti Sakari Rannikko
    John A. D. Aston
    Jing Tang
    Carola-Bibiane Schönlieb
    Communications Medicine, 3
  • [3] The impact of imputation quality on machine learning classifiers for datasets with missing values
    Shadbahr, Tolou
    Roberts, Michael
    Stanczuk, Jan
    Gilbey, Julian
    Teare, Philip
    Dittmer, Soeren
    Thorpe, Matthew
    Torne, Ramon Vinas
    Sala, Evis
    Lio, Pietro
    Patel, Mishal
    Preller, Jacobus
    Rudd, James H. F.
    Mirtti, Tuomas
    Rannikko, Antti Sakari
    Aston, John A. D.
    Tang, Jing
    Schonlieb, Carola-Bibiane
    COMMUNICATIONS MEDICINE, 2023, 3 (01):
  • [4] Missing Values and Imputation in Healthcare Data: Can Interpretable Machine Learning Help?
    Chen, Zhi
    Tan, Sarah
    Chajewska, Urszula
    Rudin, Cynthia
    Caruana, Rich
    CONFERENCE ON HEALTH, INFERENCE, AND LEARNING, VOL 209, 2023, 209 : 86 - 99
  • [5] A First Approach on Big Data Missing Values Imputation
    Montesdeoca, Besay
    Luengo, Julian
    Maillo, Jesus
    Garcia-Gil, Diego
    Garcia, Salvador
    Herrera, Francisco
    PROCEEDINGS OF THE 4TH INTERNATIONAL CONFERENCE ON INTERNET OF THINGS, BIG DATA AND SECURITY (IOTBDS 2019), 2019, : 315 - 323
  • [6] Analysis of Machine Learning Based Imputation of Missing Data
    Rizvi, Syed Tahir Hussain
    Latif, Muhammad Yasir
    Amin, Muhammad Saad
    Telmoudi, Achraf Jabeur
    Shah, Nasir Ali
    CYBERNETICS AND SYSTEMS, 2023,
  • [7] Approximate Imputation Method for Missing Data in Machine Learning
    Cao W.
    Chu Y.
    Li X.
    1600, Xi'an Jiaotong University (51): : 142 - 148
  • [8] Deep Learning Approach for Imputation of Missing Values in Actigraphy Data: Algorithm Development Study
    Jang, Jong-Hwan
    Choi, Junggu
    Roh, Hyun Woong
    Son, Sang Joon
    Hong, Chang Hyung
    Kim, Eun Young
    Kim, Tae Young
    Yoon, Dukyong
    JMIR MHEALTH AND UHEALTH, 2020, 8 (07):
  • [9] Methods for imputation of missing values in air quality data sets
    Junninen, H
    Niska, H
    Tuppurainen, K
    Ruuskanen, J
    Kolehmainen, M
    ATMOSPHERIC ENVIRONMENT, 2004, 38 (18) : 2895 - 2907
  • [10] Assessing the impact of missing data on water quality index estimation: a machine learning approach
    David Sierra-Porta
    Discover Water, 4 (1):