Dealing with Data Bias in Classification: Can Generated Data Ensure Representation and Fairness?

被引：0

作者：

Manh Khoi Duong ^{[1
]}

Conrad, Stefan ^{[1
]}

机构：

[1] Heinrich Heine Univ, Univ Str 1, D-40225 Dusseldorf, Germany

来源：

BIG DATA ANALYTICS AND KNOWLEDGE DISCOVERY, DAWAK 2023 | 2023年 / 14148卷

关键词：

fairness; bias; synthetic data; fairness-agnostic; machine learning; optimization;

D O I：

10.1007/978-3-031-39831-5_17

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Fairness is a critical consideration in data analytics and knowledge discovery because biased data can perpetuate inequalities through further pipelines. In this paper, we propose a novel pre-processing method to address fairness issues in classification tasks by adding synthetic data points for more representativeness. Our approach utilizes a statistical model to generate new data points, which are evaluated for fairness using discrimination measures. These measures aim to quantify the disparities between demographic groups that may be induced by the bias in data. Our experimental results demonstrate that the proposed method effectively reduces bias for several machine learning classifiers without compromising prediction performance. Moreover, our method outperforms existing pre-processing methods on multiple datasets by Pareto-dominating them in terms of performance and fairness. Our findings suggest that our method can be a valuable tool for data analysts and knowledge discovery practitioners who seek to yield for fair, diverse, and representative data.

引用

页码：176 / 190

页数：15

共 50 条

[31] Bias analysis in text classification for highly skewed data
Tang, L
Liu, H
FIFTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2005, : 781 - 784
[32] Social-minded Measures of Data Quality: Fairness, Diversity, and Lack of Bias
Pitoura, Evaggelia
ACM JOURNAL OF DATA AND INFORMATION QUALITY, 2020, 12 (03):
[33] Algorithmic Bias: From Discrimination Discovery to Fairness-aware Data Mining
Hajian, Sara
Bonchi, Francesco
Castillo, Carlos
KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, : 2125 - 2126
[34] JupyterLab in Retrograde: Contextual Notifications That Highlight Fairness and Bias Issues for Data Scientists
Harrison, Galen
Bryson, Kevin
Bamba, Ahmad Emmanuel Balla
Dovichi, Luca
Binion, Aleksander Herrmann
Borem, Arthur
Ur, Blase
PROCEEDINGS OF THE 2024 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYTEMS (CHI 2024), 2024,
[35] Bias and Precision of the "Multiple Imputation, Then Deletion" Method for Dealing With Missing Outcome Data
Sullivan, Thomas R.
Salter, Amy B.
Ryan, Philip
Lee, Katherine J.
AMERICAN JOURNAL OF EPIDEMIOLOGY, 2015, 182 (06) : 528 - 534
[36] Binary Classification Optimisation with AI-Generated Data
Mazon, Manuel Jesus Cerezo
Garcia, Ricardo Moya
Garcia, Ekaitz Arriola
del Castillo, Miguel Herencia Garcia
Iglesias, Guillermo
TESTING SOFTWARE AND SYSTEMS, ICTSS 2024, 2025, 15383 : 210 - 216
[37] Understanding Demographic Bias and Representation in Social Media Health Data
Cesare, Nina
Grant, Christan
Nsoesie, Elaine O.
COMPANION OF THE 11TH ACM CONFERENCE ON WEB SCIENCE (WEBSCI' 19), 2019, : 7 - 9
[38] Samples classification using microarray data: Dealing with potential diagnostics misclassification
Rekaya, R.
POULTRY SCIENCE, 2004, 83 : 416 - 416
[39] Samples classification using microarray data: Dealing with potential diagnostics misclassification
Rekaya, R.
JOURNAL OF DAIRY SCIENCE, 2004, 87 : 416 - 416
[40] Social choice in distributed classification tasks: Dealing with vertically partitioned data
Recarnonde-Mendoza, Mariana
Bazzan, Ana L. C.
INFORMATION SCIENCES, 2016, 332 : 56 - 71

← 1 2 3 4 5 →