Automated Data Cleaning can Hurt Fairness in Machine Learning-Based Decision Making

被引:0
|
作者
Guha, Shubha [1 ]
Khan, Falaah Arif [2 ]
Stoyanovich, Julia [2 ]
Schelter, Sebastian [1 ]
机构
[1] Univ Amsterdam, NL-1012 WP Amsterdam, Netherlands
[2] NYU, New York, NY 10012 USA
关键词
Cleaning; Data models; Data integrity; Task analysis; Production; Decision making; Medical services; Responsible data management; data cleaning; data preparation; fairness in machine learning;
D O I
10.1109/TKDE.2024.3365524
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we interrogate whether data quality issues track demographic group membership (based on sex, race and age) and whether automated data cleaning - of the kind commonly used in production ML systems - impacts the fairness of predictions made by these systems. To the best of our knowledge, the impact of data cleaning on fairness in downstream tasks has not been investigated in the literature. We first analyse the tuples flagged by common error detection strategies in five research datasets. We find that, while specific data quality issues, such as higher rates of missing values, are associated with membership in historically disadvantaged groups, poor data quality does not generally track demographic group membership. As a follow-up, we conduct a large-scale empirical study on the impact of automated data cleaning on fairness, involving more than 26,000 model evaluations. We observe that, while automated data cleaning is unlikely to worsen accuracy, it is more likely to worsen fairness than to improve it, especially when the cleaning techniques are not carefully chosen. Furthermore, we find that the positive or negative impact of a particular cleaning technique often depends on the choice of fairness metric and group definition (single-attribute or intersectional). We make our code and experimental results publicly available. The analysis we conducted in this paper is difficult, primarily because it requires that we think holistically about disparities in data quality, disparities in the effectiveness of data cleaning methods, and impacts of such disparities on ML model performance for different demographic groups. Such holistic analysis can and should be supported by data engineering tools, and requires substantial data engineering research. Towards this goal, we discuss open research questions, envision the development of fairness-aware data cleaning methods, and their integration into complex pipelines for ML-based decision making.
引用
收藏
页码:7368 / 7379
页数:12
相关论文
共 50 条
  • [1] Pitfalls of Machine Learning-Based Personnel Selection Fairness, Transparency, and Data Quality
    Goretzko, David
    Finja Israel, Laura Sophia
    JOURNAL OF PERSONNEL PSYCHOLOGY, 2022, 21 (01) : 37 - 47
  • [2] Data Protection Issues in Automated Decision-Making Systems Based on Machine Learning: Research Challenges
    Christodoulou, Paraskevi
    Limniotis, Konstantinos
    NETWORK, 2024, 4 (01): : 91 - 113
  • [3] Machine Learning-Based Decision-Making for Stock Trading: Case Study for Automated Trading in Saudi Stock Exchange
    Alsulmi, Mohammad
    A-Shahrani, Nourah
    SCIENTIFIC PROGRAMMING, 2022, 2022
  • [4] A Machine Learning-Based Method for Assisted Analysis and Decision Making of Wafer Yield
    Qin, Jieli
    2024 4TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND INTELLIGENT SYSTEMS ENGINEERING, MLISE 2024, 2024, : 183 - 186
  • [5] Machine learning-based clinical decision support using laboratory data
    Cubukcu, Hikmet Can
    Topcu, Deniz Ilhan
    Yenice, Sedef
    CLINICAL CHEMISTRY AND LABORATORY MEDICINE, 2024, 62 (05) : 793 - 823
  • [6] How Linked Data can Aid Machine Learning-Based Tasks
    Mountantonakis, Michalis
    Tzitzikas, Yannis
    RESEARCH AND ADVANCED TECHNOLOGY FOR DIGITAL LIBRARIES (TPDL 2017), 2017, 10450 : 155 - 168
  • [7] Reinforcement Learning-Based Algorithm for Real-Time Automated Parking Decision Making
    Wei, Xiaoyi
    Hou, Taixian
    Zhao, Xiao
    Tu, Jiaxin
    Guan, Haiyang
    Zhai, Peng
    Zhang, Lihua
    ARTIFICIAL INTELLIGENCE, CICAI 2023, PT II, 2024, 14474 : 242 - 252
  • [8] Fairness Analysis of Machine Learning-Based Code Reviewer Recommendation
    Mohajer, Mohammad Mahdi
    Belle, Alvine Boaye
    Harzevili, Nima Shiri
    Wang, Junjie
    Hemmati, Hadi
    Wang, Song
    Jiang, Zhen Ming
    ADVANCES IN BIAS AND FAIRNESS IN INFORMATION RETRIEVAL, BIAS 2024, 2025, 2227 : 46 - 63
  • [9] Machine Learning-Based Decision-Making Mechanism for Risk Assessment of Cardiovascular Disease
    Wang, Cheng
    Zhu, Haoran
    Rao, Congjun
    CMES-COMPUTER MODELING IN ENGINEERING & SCIENCES, 2024, 138 (01): : 691 - 718
  • [10] Algorithmic Decision-Making Based on Machine Learning from Big Data: Can Transparency Restore Accountability?
    de Laat P.B.
    Philosophy & Technology, 2018, 31 (4) : 525 - 541