Dealing with Data Bias in Classification: Can Generated Data Ensure Representation and Fairness?

被引:0
|
作者
Manh Khoi Duong [1 ]
Conrad, Stefan [1 ]
机构
[1] Heinrich Heine Univ, Univ Str 1, D-40225 Dusseldorf, Germany
关键词
fairness; bias; synthetic data; fairness-agnostic; machine learning; optimization;
D O I
10.1007/978-3-031-39831-5_17
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Fairness is a critical consideration in data analytics and knowledge discovery because biased data can perpetuate inequalities through further pipelines. In this paper, we propose a novel pre-processing method to address fairness issues in classification tasks by adding synthetic data points for more representativeness. Our approach utilizes a statistical model to generate new data points, which are evaluated for fairness using discrimination measures. These measures aim to quantify the disparities between demographic groups that may be induced by the bias in data. Our experimental results demonstrate that the proposed method effectively reduces bias for several machine learning classifiers without compromising prediction performance. Moreover, our method outperforms existing pre-processing methods on multiple datasets by Pareto-dominating them in terms of performance and fairness. Our findings suggest that our method can be a valuable tool for data analysts and knowledge discovery practitioners who seek to yield for fair, diverse, and representative data.
引用
收藏
页码:176 / 190
页数:15
相关论文
共 50 条
  • [21] A robust classification of galaxy spectra: Dealing with noisy and incomplete data
    Connolly, AJ
    Szalay, AS
    ASTRONOMICAL JOURNAL, 1999, 117 (05): : 2052 - 2062
  • [22] Dissimilarity representation on functional spectral data for classification
    Porro-Munoz, Diana
    Talavera, Isneri
    Duin, Robert P. W.
    Hernandez, Noslen
    Orozco-Alzate, Mauricio
    JOURNAL OF CHEMOMETRICS, 2011, 25 (09) : 476 - 486
  • [23] Representation and classification for high-throughput data
    Wessels, LFA
    Reinders, MJT
    van Welsem, T
    Nederlof, PM
    BIOMEDICAL NANOTECHNOLOGY ARCHITECTURES AND APPLICATIONS, 2002, 4626 : 226 - 237
  • [24] Classification of geospatial lattice data and their graphical representation
    Kurihara, K
    CLASSIFICATION, CLUSTERING, AND DATA MINING APPLICATIONS, 2004, : 251 - 258
  • [25] Automated semiconductor wafer defect classification dealing with imbalanced data
    Lee, Po-Hsuan
    Wang, Zhe
    Teh, Cho
    Hsiao, Yi-Sing
    Fang, Wei
    METROLOGY, INSPECTION, AND PROCESS CONTROL FOR MICROLITHOGRAPHY XXXIV, 2020, 11325
  • [26] Bias oriented unbiased data augmentation for cross-bias representation learning
    Li, Lei
    Tang, Fan
    Cao, Juan
    Li, Xirong
    Wang, Danding
    MULTIMEDIA SYSTEMS, 2023, 29 (02) : 725 - 738
  • [27] Bias oriented unbiased data augmentation for cross-bias representation learning
    Lei Li
    Fan Tang
    Juan Cao
    Xirong Li
    Danding Wang
    Multimedia Systems, 2023, 29 : 725 - 738
  • [28] Can You Fake It Until You Make It?: Impacts of Differentially Private Synthetic Data on Downstream Classification Fairness
    Cheng, Victoria
    Suriyakumar, Vinith M.
    Dullerud, Natalie
    Joshi, Shalmali
    Ghassemi, Marzyeh
    PROCEEDINGS OF THE 2021 ACM CONFERENCE ON FAIRNESS, ACCOUNTABILITY, AND TRANSPARENCY, FACCT 2021, 2021, : 149 - 160
  • [29] Data Classification by Reducing Bias of Domain-oriented Knowledge Based on Data
    Senda, Masahiro
    Iwasa, Daiji
    Hayashi, Teruaki
    Ohsawa, Yukio
    2019 6TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND INTEGRATED NETWORKS (SPIN), 2019, : 404 - 407
  • [30] Can Synthetic Data be Fair and Private? A Comparative Study of Synthetic Data Generation and Fairness Algorithms
    Liu, Qinyi
    Deho, Oscar
    Vadiee, Farhad
    Khalil, Mohammad
    Joksimovic, Srecko
    Siemens, George
    FIFTEENTH INTERNATIONAL CONFERENCE ON LEARNING ANALYTICS & KNOWLEDGE, LAK 2025, 2025, : 591 - 600