Artificial Intelligence Generated Synthetic Datasets as the Remedy for Data Scarcity in Water Quality Index Estimation

被引:9
|
作者
Chia, Min Yan [1 ]
Koo, Chai Hoon [1 ]
Huang, Yuk Feng [1 ]
Di Chan, Wei [1 ]
Pang, Jia Yin [1 ]
机构
[1] Univ Tunku Abdul Rahman, Lee Kong Chian Fac Engn & Sci, Dept Civil Engn, Bandar Sungai Long, Selangor, Malaysia
关键词
synthetic data; artificial intelligence; back-propagation neural network; water quality index;
D O I
10.1007/s11269-023-03650-6
中图分类号
TU [建筑科学];
学科分类号
0813 ;
摘要
Water quality index (WQI) has been utilised in many countries and regions as a numeric representation of the condition of water resources. However, the computation of the WQI involves a host of water quality variables. Although machine learning models are proven to be a promising tool to estimate WQI with lesser inputs, sufficient data or samples must be collected so that the machine learning models can be trained well. This exhibits a great challenge in places where there has been a lack of data collection infrastructure to meet the needs of machine learning models. Data scarcity is a major issue to be tackled. This study covered two major rivers that served as water intakes in Peninsular Malaysia (Selangor River and Skudai River), where four synthetic data generation methods, namely the conditional tabular generative adversarial network (CTGAN), the tabular variational autoencoder (TVAE), the Gaussian copula (GC) and the copula generative adversarial network (CopulaGAN), were used to synthesise datasets based on the real dataset. By using the pairwise correlation difference (PCD), Kullback-Leibler divergence (KLD) and the Kolmogorov-Smirnov (KS) test, the best synthetic datasets were selected for the two rivers. The CopulaGAN1 and the CopulaGAN2 yielded the best small and large synthetic datasets at Selangor River, scoring the lowest PCD, KLD and KS statistics. For the Skudai River, the TVAE1 and TVAE2 were chosen. The real and synthetic datasets were used to train the back-propagation neural network (BPNN) for the WQI estimation. Based on the various evaluation metrics, it was proven that increasing the size of training data using the synthetic data method had a positive impact on the performance of the BPNN. The BPNN trained with the CopulaGAN2 (at Selangor River) and the TVAE2 (at Skudai River) yielded more accurate estimations compared to those derived from the actual and smaller datasets. Data were insufficient to train machine learning model well in developing regions.Synthetic data methods can overcome the data scarcity issue in Malaysia.CopulaGAN and TVAE outperformed other methods at Selangor River and Skudai River.BPNN trained with synthetic datasets estimated WQI with higher accuracy.
引用
收藏
页码:6183 / 6198
页数:16
相关论文
共 50 条
  • [31] Developing a standardized framework for curating oncology datasets generated by manual abstraction and artificial intelligence.
    Grant, Benjamin M.
    Zarrin, Aein
    Zhan, Luna
    Ajaj, Rami
    Darwish, Lina
    Khan, Khaleeq
    Patel, Devalben
    Chiasson, Kaitlyn
    Balaratnam, Karmugi
    Chowdhury, Maisha T.
    Sabouhanian, Amir-Arsalan
    Herman, Joshua
    Walia, Preet
    Strom, Evan
    Brown, Catherine
    Garcia-Pardo, Miguel
    Schmid, Sabine
    Pettengell, Christopher
    Stewart, Erin L.
    Liu, Geoffrey
    CANCER RESEARCH, 2022, 82 (12)
  • [32] A novel framework for high resolution air quality index prediction with interpretable artificial intelligence and uncertainties estimation
    Wu, Junhao
    Chen, Xi
    Li, Rui
    Wang, Anqi
    Huang, Shutong
    Li, Qingli
    Qi, Honggang
    Liu, Min
    Cheng, Heqin
    Wang, Zhaocai
    JOURNAL OF ENVIRONMENTAL MANAGEMENT, 2024, 357
  • [33] Synthetic Data Generation using Diffusion Models for ML-based Lightpath Quality of Transmission Estimation Under Extreme Data Scarcity
    Andreoletti, Davide
    Rottondi, Cristina
    Ayoub, Omran
    Bianco, Andrea
    2024 24TH INTERNATIONAL CONFERENCE ON TRANSPARENT OPTICAL NETWORKS, ICTON 2024, 2024,
  • [34] Water quality big data analysis of the river basin with artificial intelligence ADV monitoring
    Chen, Z. Y.
    Meng, Yahui
    Wang, Ruei-yuan
    Chen, Timothy
    MEMBRANE AND WATER TREATMENT, 2022, 13 (05): : 219 - 225
  • [35] Artificial intelligence and water quality: From drinking water to wastewater
    Perez-Beltran, C. H.
    Robles, A. D.
    Rodriguez, N. A.
    Ortega-Gavilan, F.
    Jimenez-Carvelo, A. M.
    TRAC-TRENDS IN ANALYTICAL CHEMISTRY, 2024, 172
  • [36] Artificial intelligence in water quality monitoring: a review of water quality assessment applications
    Frincu, Rodica Mihaela
    WATER QUALITY RESEARCH JOURNAL, 2025, 60 (01) : 164 - 176
  • [37] Data Quality, Data Sharing, and Moving Artificial Intelligence Forward
    Elmore, Joann G.
    Lee, Christoph I.
    JAMA NETWORK OPEN, 2021, 4 (08)
  • [38] Explainable Artificial Intelligence for Deep Synthetic Data Generation Models
    Valina, Luis
    Teixeira, Brigida
    Reis, Amalie
    Vale, Zita
    Pinto, Tiago
    2024 IEEE CONFERENCE ON ARTIFICIAL INTELLIGENCE, CAI 2024, 2024, : 555 - 556
  • [39] GENERATIVE ARTIFICIAL INTELLIGENCE FOR CREATION OF SYNTHETIC HYPERTENSION TRIAL DATA
    Jain, Chirag
    Judge, Conor
    NEPHROLOGY DIALYSIS TRANSPLANTATION, 2023, 38 : I1014 - I1014
  • [40] Artificial intelligence for heart disease prediction and imputation of missing data in cardiovascular datasets
    Najim, Ahmed Haitham
    Nasri, Nejah
    COGENT ENGINEERING, 2024, 11 (01):