Dealing with Data Bias in Classification: Can Generated Data Ensure Representation and Fairness?

被引：0

作者：

Manh Khoi Duong ^{[1
]}

Conrad, Stefan ^{[1
]}

机构：

[1] Heinrich Heine Univ, Univ Str 1, D-40225 Dusseldorf, Germany

来源：

BIG DATA ANALYTICS AND KNOWLEDGE DISCOVERY, DAWAK 2023 | 2023年 / 14148卷

关键词：

fairness; bias; synthetic data; fairness-agnostic; machine learning; optimization;

D O I：

10.1007/978-3-031-39831-5_17

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Fairness is a critical consideration in data analytics and knowledge discovery because biased data can perpetuate inequalities through further pipelines. In this paper, we propose a novel pre-processing method to address fairness issues in classification tasks by adding synthetic data points for more representativeness. Our approach utilizes a statistical model to generate new data points, which are evaluated for fairness using discrimination measures. These measures aim to quantify the disparities between demographic groups that may be induced by the bias in data. Our experimental results demonstrate that the proposed method effectively reduces bias for several machine learning classifiers without compromising prediction performance. Moreover, our method outperforms existing pre-processing methods on multiple datasets by Pareto-dominating them in terms of performance and fairness. Our findings suggest that our method can be a valuable tool for data analysts and knowledge discovery practitioners who seek to yield for fair, diverse, and representative data.

引用

页码：176 / 190

页数：15

共 50 条

[21] A robust classification of galaxy spectra: Dealing with noisy and incomplete data
Connolly, AJ
Szalay, AS
ASTRONOMICAL JOURNAL, 1999, 117 (05): : 2052 - 2062
[22] Dissimilarity representation on functional spectral data for classification
Porro-Munoz, Diana
Talavera, Isneri
Duin, Robert P. W.
Hernandez, Noslen
Orozco-Alzate, Mauricio
JOURNAL OF CHEMOMETRICS, 2011, 25 (09) : 476 - 486
[23] Representation and classification for high-throughput data
Wessels, LFA
Reinders, MJT
van Welsem, T
Nederlof, PM
BIOMEDICAL NANOTECHNOLOGY ARCHITECTURES AND APPLICATIONS, 2002, 4626 : 226 - 237
[24] Classification of geospatial lattice data and their graphical representation
Kurihara, K
CLASSIFICATION, CLUSTERING, AND DATA MINING APPLICATIONS, 2004, : 251 - 258
[25] Automated semiconductor wafer defect classification dealing with imbalanced data
Lee, Po-Hsuan
Wang, Zhe
Teh, Cho
Hsiao, Yi-Sing
Fang, Wei
METROLOGY, INSPECTION, AND PROCESS CONTROL FOR MICROLITHOGRAPHY XXXIV, 2020, 11325
[26] Bias oriented unbiased data augmentation for cross-bias representation learning
Li, Lei
Tang, Fan
Cao, Juan
Li, Xirong
Wang, Danding
MULTIMEDIA SYSTEMS, 2023, 29 (02) : 725 - 738
[27] Bias oriented unbiased data augmentation for cross-bias representation learning
Lei Li
Fan Tang
Juan Cao
Xirong Li
Danding Wang
Multimedia Systems, 2023, 29 : 725 - 738
[28] Can You Fake It Until You Make It?: Impacts of Differentially Private Synthetic Data on Downstream Classification Fairness
Cheng, Victoria
Suriyakumar, Vinith M.
Dullerud, Natalie
Joshi, Shalmali
Ghassemi, Marzyeh
PROCEEDINGS OF THE 2021 ACM CONFERENCE ON FAIRNESS, ACCOUNTABILITY, AND TRANSPARENCY, FACCT 2021, 2021, : 149 - 160
[29] Data Classification by Reducing Bias of Domain-oriented Knowledge Based on Data
Senda, Masahiro
Iwasa, Daiji
Hayashi, Teruaki
Ohsawa, Yukio
2019 6TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND INTEGRATED NETWORKS (SPIN), 2019, : 404 - 407
[30] Can Synthetic Data be Fair and Private? A Comparative Study of Synthetic Data Generation and Fairness Algorithms
Liu, Qinyi
Deho, Oscar
Vadiee, Farhad
Khalil, Mohammad
Joksimovic, Srecko
Siemens, George
FIFTEENTH INTERNATIONAL CONFERENCE ON LEARNING ANALYTICS & KNOWLEDGE, LAK 2025, 2025, : 591 - 600

← 1 2 3 4 5 →