Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach

被引:36
|
作者
Rodriguez, Rafael [1 ]
Pastorini, Marcos [2 ]
Etcheverry, Lorena [2 ]
Chreties, Christian [1 ]
Fossati, Monica [1 ]
Castro, Alberto [2 ]
Gorgoglione, Angela [1 ]
机构
[1] Univ Republica, Fac Ingn, Inst Mecan Fluidos & Ingn Ambiental IMFIA, Montevideo 11300, Uruguay
[2] Univ Republica, Fac Ingn, Inst Computac InCo, Montevideo 11300, Uruguay
关键词
data scarcity; water quality; missing data; univariate imputation; multivariate imputation; machine learning; hydroinformatics; PRECIPITATION RECORDS; TEMPERATURE; ACCURACY; IMPROVE; RUNOFF; RIVER; IDW;
D O I
10.3390/su13116318
中图分类号
X [环境科学、安全科学];
学科分类号
08 ; 0830 ;
摘要
The monitoring of surface-water quality followed by water-quality modeling and analysis are essential for generating effective strategies in surface-water-resource management. However, worldwide, particularly in developing countries, water-quality studies are limited due to the lack of a complete and reliable dataset of surface-water-quality variables. In this context, several statistical and machine-learning models were assessed for imputing water-quality data at six monitoring stations located in the Santa Lucia Chico river (Uruguay), a mixed lotic and lentic river system. The challenge of this study is represented by the high percentage of missing data (between 50% and 70%) and the high temporal and spatial variability that characterizes the water-quality variables. The competing algorithms implement univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Hubber Regressor (HR), Support Vector Regressor (SVR) and K-nearest neighbors Regressor (KNNR)). According to the results, more than 76% of the imputation outcomes are considered "satisfactory" (NSE > 0.45). The imputation performance shows better results at the monitoring stations located inside the reservoir than those positioned along the mainstream. IDW was the model with the best imputation results, followed by RFR, HR and SVR. The approach proposed in this study is expected to aid water-resource researchers and managers in augmenting water-quality datasets and overcoming the missing data issue to increase the number of future studies related to the water-quality matter.
引用
收藏
页数:17
相关论文
共 50 条
  • [11] On the Issue of Incomplete and Missing Water-Quality Data in Mine Site Databases: Comparing Three Imputation Methods
    Betrie, Getnet D.
    Sadiq, Rehan
    Tesfamariam, Solomon
    Morin, Kevin A.
    MINE WATER AND THE ENVIRONMENT, 2016, 35 (01) : 3 - 9
  • [12] Missing Data Imputation using Machine Learning Algorithm for Supervised Learning
    Cenitta, D.
    Arjunan, R. Vijaya
    Prema, K., V
    2021 INTERNATIONAL CONFERENCE ON COMPUTER COMMUNICATION AND INFORMATICS (ICCCI), 2021,
  • [13] ExtraImpute: A Novel Machine Learning Method for Missing Data Imputation
    Alabadla, Mustafa
    Sidi, Fatimah
    Ishak, Iskandar
    Ibrahim, Hamidah
    Affendey, Lilly Suriani
    Hamdan, Hazlina
    JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, 2022, 13 (05) : 470 - 476
  • [14] Machine Learning Based Missing Data Imputation in Categorical Datasets
    Ishaq, Muhammad
    Zahir, Sana
    Iftikhar, Laila
    Bulbul, Mohammad Farhad
    Rho, Seungmin
    Lee, Mi Young
    IEEE ACCESS, 2024, 12 : 88332 - 88344
  • [15] Machine-Learning-Based Imputation Method for Filling Missing Values in Ground Meteorological Observation Data
    Li, Cong
    Ren, Xupeng
    Zhao, Guohui
    ALGORITHMS, 2023, 16 (09)
  • [16] A Comparison of Various Imputation Methods for Missing Values in Air Quality Data
    Zainuri, Nuryazmin Ahmat
    Jemain, Abdul Aziz
    Muda, Nora
    SAINS MALAYSIANA, 2015, 44 (03): : 449 - 456
  • [17] A Machine Learning-Based Missing Data Imputation with FHIR Interoperability Approach in Sepsis Prediction
    Toro Beltran, Cristian Fernando
    Villarreal Ibanez, Erick Daniel
    Milen Orejuela, Vivian
    Garcia Henao, John Anderson
    HIGH PERFORMANCE COMPUTING, CARLA 2022, 2022, 1660 : 116 - 130
  • [18] MULTIPLE IMPUTATION AS A MISSING DATA MACHINE
    BRAND, J
    VANBUUREN, S
    VANMULLIGEN, EM
    TIMMERS, T
    GELSEMA, E
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 1994, : 303 - 306
  • [19] Machine Learning Aids Imputation of Missing Petrophysical Data in Iraqi Reservoir
    Abdulkhaleq, Hussein B.
    Al-Mudhafar, Watheq J.
    Wood, David A.
    JPT, Journal of Petroleum Technology, 1600, 76 (08): : 58 - 61
  • [20] Sharpening the BLADE: Missing Data Imputation Using Supervised Machine Learning
    Suresh, Marcus
    Taib, Ronnie
    Zhao, Yanchang
    Jin, Warren
    AI 2019: ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, 11919 : 215 - 227