Artificial Intelligence Generated Synthetic Datasets as the Remedy for Data Scarcity in Water Quality Index Estimation

被引:9
|
作者
Chia, Min Yan [1 ]
Koo, Chai Hoon [1 ]
Huang, Yuk Feng [1 ]
Di Chan, Wei [1 ]
Pang, Jia Yin [1 ]
机构
[1] Univ Tunku Abdul Rahman, Lee Kong Chian Fac Engn & Sci, Dept Civil Engn, Bandar Sungai Long, Selangor, Malaysia
关键词
synthetic data; artificial intelligence; back-propagation neural network; water quality index;
D O I
10.1007/s11269-023-03650-6
中图分类号
TU [建筑科学];
学科分类号
0813 ;
摘要
Water quality index (WQI) has been utilised in many countries and regions as a numeric representation of the condition of water resources. However, the computation of the WQI involves a host of water quality variables. Although machine learning models are proven to be a promising tool to estimate WQI with lesser inputs, sufficient data or samples must be collected so that the machine learning models can be trained well. This exhibits a great challenge in places where there has been a lack of data collection infrastructure to meet the needs of machine learning models. Data scarcity is a major issue to be tackled. This study covered two major rivers that served as water intakes in Peninsular Malaysia (Selangor River and Skudai River), where four synthetic data generation methods, namely the conditional tabular generative adversarial network (CTGAN), the tabular variational autoencoder (TVAE), the Gaussian copula (GC) and the copula generative adversarial network (CopulaGAN), were used to synthesise datasets based on the real dataset. By using the pairwise correlation difference (PCD), Kullback-Leibler divergence (KLD) and the Kolmogorov-Smirnov (KS) test, the best synthetic datasets were selected for the two rivers. The CopulaGAN1 and the CopulaGAN2 yielded the best small and large synthetic datasets at Selangor River, scoring the lowest PCD, KLD and KS statistics. For the Skudai River, the TVAE1 and TVAE2 were chosen. The real and synthetic datasets were used to train the back-propagation neural network (BPNN) for the WQI estimation. Based on the various evaluation metrics, it was proven that increasing the size of training data using the synthetic data method had a positive impact on the performance of the BPNN. The BPNN trained with the CopulaGAN2 (at Selangor River) and the TVAE2 (at Skudai River) yielded more accurate estimations compared to those derived from the actual and smaller datasets. Data were insufficient to train machine learning model well in developing regions.Synthetic data methods can overcome the data scarcity issue in Malaysia.CopulaGAN and TVAE outperformed other methods at Selangor River and Skudai River.BPNN trained with synthetic datasets estimated WQI with higher accuracy.
引用
收藏
页码:6183 / 6198
页数:16
相关论文
共 50 条
  • [21] ESTIMATION OF CLINICAL TRAIL DATA BY ARTIFICIAL INTELLIGENCE
    Miyagi, Y.
    Fujiwara, K.
    Takashi, O.
    Miyake, T.
    INTERNATIONAL JOURNAL OF GYNECOLOGICAL CANCER, 2018, 28 : 191 - 191
  • [22] Artificial intelligence publications: synthetic data, patients, and papers
    Andreas F. Mavrogenis
    Marius M. Scarlat
    International Orthopaedics, 2023, 47 : 1395 - 1396
  • [23] Synthetic data in biomedicine via generative artificial intelligence
    Boris van Breugel
    Tennison Liu
    Dino Oglic
    Mihaela van der Schaar
    Nature Reviews Bioengineering, 2024, 2 (12): : 991 - 1004
  • [24] Artificial intelligence publications: synthetic data, patients, and papers
    Mavrogenis, Andreas F.
    Scarlat, Marius M.
    INTERNATIONAL ORTHOPAEDICS, 2023, 47 (06) : 1395 - 1396
  • [25] From data to artificial intelligence: evaluating the readiness of gastrointestinal endoscopy datasets
    Elamin, Sami
    Johri, Shreya
    Rajpurkar, Pranav
    Geisler, Enrik
    Berzin, Tyler M.
    JOURNAL OF THE CANADIAN ASSOCIATION OF GASTROENTEROLOGY, 2025, 8 : S81 - S86
  • [26] From data to diagnosis: skin cancer image datasets for artificial intelligence
    Wen, David
    Soltan, Andrew
    Trucco, Emanuele
    Matin, Rubeta N.
    CLINICAL AND EXPERIMENTAL DERMATOLOGY, 2024, 49 (07) : 675 - 685
  • [27] Developing a Data-Fused Water Quality Index Based on Artificial Intelligence Models to Mitigate Conflicts between GQI and GWQI
    Nadiri, Ata Allah
    Barzegar, Rahim
    Sadeghfam, Sina
    Rostami, Ali Asghar
    WATER, 2022, 14 (19)
  • [28] Application of synthetic data in the training of artificial intelligence for automated quality assurance in magnetic resonance imaging
    Tracey, John
    Moss, Laura
    Ashmore, Jonathan
    MEDICAL PHYSICS, 2023, 50 (09) : 5621 - 5629
  • [29] Synthetic Data: A New Frontier for Democratizing Artificial Intelligence and Data Access
    Majeed, Abdul
    Hwang, Seong Oun
    COMPUTER, 2025, 58 (02) : 106 - 114
  • [30] Making Use of Artificial Intelligence-Generated Synthetic Tympanic Membrane Images
    Suresh, Krish
    Cohen, Michael S.
    Hartnick, Christopher J.
    Bartholomew, Ryan A.
    Lee, Daniel J.
    Crowson, Matthew G.
    JAMA OTOLARYNGOLOGY-HEAD & NECK SURGERY, 2023, 149 (06) : 555 - 556