BayesBoost: Identifying and Handling Bias Using Synthetic Data Generators

被引:0
|
作者
Draghi, Barbara [1 ]
Wang, Zhenchen [1 ]
Myles, Puja [1 ]
Tucker, Allan [2 ]
机构
[1] Med & Healthcare Prod Regulatory Agcy, London, England
[2] Brunel Univ London, Uxbridge, Middx, England
关键词
Synthetic data generators; data bias; over-sampling; Bayesian network;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Advanced synthetic data generators can model sensitive personal datasets by creating simulated samples of data with realistic correlation structures and distributions, but with a greatly reduced risk of identifying individuals. This has huge potential in medicine where sensitive patient data can be simulated and shared, enabling the development and robust validation of new AI technologies for diagnosis and disease management. However, even when massive ground truth datasets are available (such as UK-NHS databases which contain patient records in the order of millions) there is a high risk that biases still exist which are carried over to the data generators. For example, certain cohorts of patients may be under-represented due to cultural sensitivities amongst some communities, or due to institutionalised procedures in data collection. The under-representation of groups is one of the forms in which bias can manifest itself in machine learning, and it is the one we investigate in this work.These factors may also lead to structurally missing data or incorrect correlations and distributions which will be mirrored in the synthetic data generated from biased ground truth datasets. In this paper, we explore methods to improve synthetic data generators by using probabilistic methods to firstly identify the difficult to predict data samples in ground truth data, and then to boost these types of data when generating synthetic samples. The paper explores attempts to create synthetic data that contain more realistic distributions and that lead to predictive models with better performance.
引用
收藏
页码:49 / 62
页数:14
相关论文
共 50 条
  • [1] Identifying and handling data bias within primary healthcare data using synthetic data generators
    Draghi, Barbara
    Wang, Zhenchen
    Myles, Puja
    Tucker, Allan
    HELIYON, 2024, 10 (02)
  • [2] Identifying User Preferences of Data Handling Using Assisting Technologies
    Offermann, Julia
    Wilkowska, Wiktoria
    Ziefle, Martina
    PROCEEDINGS OF THE 15TH INTERNATIONAL CONFERENCE ON PERVASIVE TECHNOLOGIES RELATED TO ASSISTIVE ENVIRONMENTS, PETRA 2022, 2022, : 129 - 136
  • [3] Deflating Dataset Bias Using Synthetic Data Augmentation
    Jaipuria, Nikita
    Zhang, Xianling
    Bhasin, Rohan
    Arafa, Mayar
    Chakravarty, Punarjay
    Shrivastava, Shubham
    Manglani, Sagar
    Murali, Vidya N.
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2020), 2020, : 3344 - 3353
  • [4] Synthetic Data Generators - Sequential and Private
    Bousquet, Olivier
    Livni, Roi
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [5] Evaluation of Synthetic Data Generators on Complex Tabular Data
    Thees, Oscar
    Novak, Jiri
    Templ, Matthias
    PRIVACY IN STATISTICAL DATABASES, PSD 2024, 2024, 14915 : 194 - 209
  • [6] Identifying biases in deterioration models using synthetic sewer data
    Scheidegger, A.
    Maurer, M.
    WATER SCIENCE AND TECHNOLOGY, 2012, 66 (11) : 2363 - 2369
  • [7] A sampling bias in identifying children in foster care using Medicaid data
    Rubin, DM
    Pati, S
    Luan, XQ
    Alessandrini, EA
    AMBULATORY PEDIATRICS, 2005, 5 (03) : 185 - 190
  • [8] Identifying Bias in Data Using Two-Distribution Hypothesis Tests
    Yik, William
    Serafini, Limnanthes
    Lindsey, Timothy
    Montanez, George D.
    PROCEEDINGS OF THE 2022 AAAI/ACM CONFERENCE ON AI, ETHICS, AND SOCIETY, AIES 2022, 2022, : 831 - 844
  • [9] A sampling bias in identifying children in foster care using Medicaid data
    Rubin, DM
    Pati, S
    Luan, X
    Alessandrini, EA
    PEDIATRIC RESEARCH, 2004, 55 (04) : 192A - 192A
  • [10] Bias On Demand: Investigating Bias with a Synthetic Data Generator
    Baumann, Joachim
    Castelnovo, Alessandro
    Cosentini, Andrea
    Crupi, Riccardo
    Inverardi, Nicole
    Regoli, Daniele
    PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 7110 - 7114