Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach

被引:36
|
作者
Rodriguez, Rafael [1 ]
Pastorini, Marcos [2 ]
Etcheverry, Lorena [2 ]
Chreties, Christian [1 ]
Fossati, Monica [1 ]
Castro, Alberto [2 ]
Gorgoglione, Angela [1 ]
机构
[1] Univ Republica, Fac Ingn, Inst Mecan Fluidos & Ingn Ambiental IMFIA, Montevideo 11300, Uruguay
[2] Univ Republica, Fac Ingn, Inst Computac InCo, Montevideo 11300, Uruguay
关键词
data scarcity; water quality; missing data; univariate imputation; multivariate imputation; machine learning; hydroinformatics; PRECIPITATION RECORDS; TEMPERATURE; ACCURACY; IMPROVE; RUNOFF; RIVER; IDW;
D O I
10.3390/su13116318
中图分类号
X [环境科学、安全科学];
学科分类号
08 ; 0830 ;
摘要
The monitoring of surface-water quality followed by water-quality modeling and analysis are essential for generating effective strategies in surface-water-resource management. However, worldwide, particularly in developing countries, water-quality studies are limited due to the lack of a complete and reliable dataset of surface-water-quality variables. In this context, several statistical and machine-learning models were assessed for imputing water-quality data at six monitoring stations located in the Santa Lucia Chico river (Uruguay), a mixed lotic and lentic river system. The challenge of this study is represented by the high percentage of missing data (between 50% and 70%) and the high temporal and spatial variability that characterizes the water-quality variables. The competing algorithms implement univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Hubber Regressor (HR), Support Vector Regressor (SVR) and K-nearest neighbors Regressor (KNNR)). According to the results, more than 76% of the imputation outcomes are considered "satisfactory" (NSE > 0.45). The imputation performance shows better results at the monitoring stations located inside the reservoir than those positioned along the mainstream. IDW was the model with the best imputation results, followed by RFR, HR and SVR. The approach proposed in this study is expected to aid water-resource researchers and managers in augmenting water-quality datasets and overcoming the missing data issue to increase the number of future studies related to the water-quality matter.
引用
收藏
页数:17
相关论文
共 50 条
  • [31] A Probabilistic Approach for Missing Data Imputation
    Arefin, Muhammed Nazmul
    Masum, Abdul Kadar Muhammad
    COMPLEXITY, 2024, 2024
  • [32] Evaluating Machine Learning Classification Using Sorted Missing Percentage Technique Based on Missing Data
    Hung, Che-Yu
    Jiang, Bernard C.
    Wang, Chien-Chih
    APPLIED SCIENCES-BASEL, 2020, 10 (14):
  • [33] Imputation of missing gas permeability data for polymer membranes using machine learning
    Yuan, Qi
    Longo, Mariagiulia
    Thornton, Aaron W.
    McKeown, Neil B.
    Comesana-Gandara, Bibiana
    Jansen, Johannes C.
    Jelfs, Kim E.
    JOURNAL OF MEMBRANE SCIENCE, 2021, 627
  • [34] Handling high-dimensional data with missing values by modern machine learning techniques
    Chen, Sixia
    Xu, Chao
    JOURNAL OF APPLIED STATISTICS, 2023, 50 (03) : 786 - 804
  • [35] Application of machine learning methods in the imputation of heterogeneous co-missing data
    So, Hon Yiu
    Ma, Jinhui
    Griffith, Lauren E.
    Balakrishnan, Narayanaswamy
    JAPANESE JOURNAL OF STATISTICS AND DATA SCIENCE, 2025,
  • [36] Variable selection with missing data in both covariates and outcomes: Imputation and machine learning
    Hu, Liangyuan
    Lin, Jung-Yi Joyce
    Ji, Jiayi
    STATISTICAL METHODS IN MEDICAL RESEARCH, 2021, 30 (12) : 2651 - 2671
  • [37] A Classifier Ensemble Machine Learning Approach to Improve Efficiency for Missing Value Imputation
    Chhabra, Geeta
    Vashisht, Vasudha
    Ranjan, Jayanthi
    2018 INTERNATIONAL CONFERENCE ON COMPUTING, POWER AND COMMUNICATION TECHNOLOGIES (GUCON), 2018, : 23 - 27
  • [38] Prediction of concrete strengths enabled by missing data imputation and interpretable machine learning
    Lyngdoh, Gideon A.
    Zaki, Mohd
    Krishnan, N. M. Anoop
    Das, Sumanta
    CEMENT & CONCRETE COMPOSITES, 2022, 128
  • [39] Active learning with missing values considering imputation uncertainty
    Han, Jongmin
    Kang, Seokho
    KNOWLEDGE-BASED SYSTEMS, 2021, 224
  • [40] Semi-supervised learning with missing values imputation
    Huang, Buliao
    Zhu, Yunhui
    Usman, Muhammad
    Chen, Huanhuan
    KNOWLEDGE-BASED SYSTEMS, 2024, 284