Artificial Intelligence Generated Synthetic Datasets as the Remedy for Data Scarcity in Water Quality Index Estimation

被引:9
|
作者
Chia, Min Yan [1 ]
Koo, Chai Hoon [1 ]
Huang, Yuk Feng [1 ]
Di Chan, Wei [1 ]
Pang, Jia Yin [1 ]
机构
[1] Univ Tunku Abdul Rahman, Lee Kong Chian Fac Engn & Sci, Dept Civil Engn, Bandar Sungai Long, Selangor, Malaysia
关键词
synthetic data; artificial intelligence; back-propagation neural network; water quality index;
D O I
10.1007/s11269-023-03650-6
中图分类号
TU [建筑科学];
学科分类号
0813 ;
摘要
Water quality index (WQI) has been utilised in many countries and regions as a numeric representation of the condition of water resources. However, the computation of the WQI involves a host of water quality variables. Although machine learning models are proven to be a promising tool to estimate WQI with lesser inputs, sufficient data or samples must be collected so that the machine learning models can be trained well. This exhibits a great challenge in places where there has been a lack of data collection infrastructure to meet the needs of machine learning models. Data scarcity is a major issue to be tackled. This study covered two major rivers that served as water intakes in Peninsular Malaysia (Selangor River and Skudai River), where four synthetic data generation methods, namely the conditional tabular generative adversarial network (CTGAN), the tabular variational autoencoder (TVAE), the Gaussian copula (GC) and the copula generative adversarial network (CopulaGAN), were used to synthesise datasets based on the real dataset. By using the pairwise correlation difference (PCD), Kullback-Leibler divergence (KLD) and the Kolmogorov-Smirnov (KS) test, the best synthetic datasets were selected for the two rivers. The CopulaGAN1 and the CopulaGAN2 yielded the best small and large synthetic datasets at Selangor River, scoring the lowest PCD, KLD and KS statistics. For the Skudai River, the TVAE1 and TVAE2 were chosen. The real and synthetic datasets were used to train the back-propagation neural network (BPNN) for the WQI estimation. Based on the various evaluation metrics, it was proven that increasing the size of training data using the synthetic data method had a positive impact on the performance of the BPNN. The BPNN trained with the CopulaGAN2 (at Selangor River) and the TVAE2 (at Skudai River) yielded more accurate estimations compared to those derived from the actual and smaller datasets. Data were insufficient to train machine learning model well in developing regions.Synthetic data methods can overcome the data scarcity issue in Malaysia.CopulaGAN and TVAE outperformed other methods at Selangor River and Skudai River.BPNN trained with synthetic datasets estimated WQI with higher accuracy.
引用
收藏
页码:6183 / 6198
页数:16
相关论文
共 50 条
  • [1] Artificial Intelligence Generated Synthetic Datasets as the Remedy for Data Scarcity in Water Quality Index Estimation
    Min Yan Chia
    Chai Hoon Koo
    Yuk Feng Huang
    Wei Di Chan
    Jia Yin Pang
    Water Resources Management, 2023, 37 : 6183 - 6198
  • [2] Air quality index estimation applying artificial intelligence
    Miguel, B. J.
    Guadalupe, C. M.
    Santiago, B. F.
    Diego, A.
    Antonio, V
    EPIDEMIOLOGY, 2007, 18 (05) : S60 - S60
  • [3] Synthetic Datasets and Medical Artificial Intelligence Specifics
    Shamaev, Dmitry
    DATA SCIENCE AND ALGORITHMS IN SYSTEMS, 2022, VOL 2, 2023, 597 : 519 - 528
  • [4] Generative artificial intelligence: synthetic datasets in dentistry
    Fahad Umer
    Niha Adnan
    BDJ Open, 10
  • [5] Generative artificial intelligence: synthetic datasets in dentistry
    Umer, Fahad
    Adnan, Niha
    BDJ OPEN, 2024, 10 (01)
  • [7] Synthetic Realities and Artificial Intelligence-Generated Contents
    Moreira, Daniel
    Marcel, Sebastien
    Rocha, Anderson
    IEEE SECURITY & PRIVACY, 2024, 22 (03) : 7 - 10
  • [8] Synthetic Realities and Artificial Intelligence-Generated Contents
    Moreira, Daniel
    Marcel, Sebastien
    Rocha, Anderson
    IEEE SECURITY & PRIVACY, 2024, 22 (04) : 101 - 102
  • [9] Artificial intelligence-assisted water quality index determination for healthcare
    Manocha, Ankush
    Sood, Sandeep Kumar
    Bhatia, Munish
    ARTIFICIAL INTELLIGENCE REVIEW, 2023, 56 (SUPPL 2) : 2893 - 2915
  • [10] Artificial intelligence-assisted water quality index determination for healthcare
    Ankush Manocha
    Sandeep Kumar Sood
    Munish Bhatia
    Artificial Intelligence Review, 2023, 56 : 2893 - 2915