Automated Data Cleaning can Hurt Fairness in Machine Learning-Based Decision Making

被引:0
|
作者
Guha, Shubha [1 ]
Khan, Falaah Arif [2 ]
Stoyanovich, Julia [2 ]
Schelter, Sebastian [1 ]
机构
[1] Univ Amsterdam, NL-1012 WP Amsterdam, Netherlands
[2] NYU, New York, NY 10012 USA
关键词
Cleaning; Data models; Data integrity; Task analysis; Production; Decision making; Medical services; Responsible data management; data cleaning; data preparation; fairness in machine learning;
D O I
10.1109/TKDE.2024.3365524
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we interrogate whether data quality issues track demographic group membership (based on sex, race and age) and whether automated data cleaning - of the kind commonly used in production ML systems - impacts the fairness of predictions made by these systems. To the best of our knowledge, the impact of data cleaning on fairness in downstream tasks has not been investigated in the literature. We first analyse the tuples flagged by common error detection strategies in five research datasets. We find that, while specific data quality issues, such as higher rates of missing values, are associated with membership in historically disadvantaged groups, poor data quality does not generally track demographic group membership. As a follow-up, we conduct a large-scale empirical study on the impact of automated data cleaning on fairness, involving more than 26,000 model evaluations. We observe that, while automated data cleaning is unlikely to worsen accuracy, it is more likely to worsen fairness than to improve it, especially when the cleaning techniques are not carefully chosen. Furthermore, we find that the positive or negative impact of a particular cleaning technique often depends on the choice of fairness metric and group definition (single-attribute or intersectional). We make our code and experimental results publicly available. The analysis we conducted in this paper is difficult, primarily because it requires that we think holistically about disparities in data quality, disparities in the effectiveness of data cleaning methods, and impacts of such disparities on ML model performance for different demographic groups. Such holistic analysis can and should be supported by data engineering tools, and requires substantial data engineering research. Towards this goal, we discuss open research questions, envision the development of fairness-aware data cleaning methods, and their integration into complex pipelines for ML-based decision making.
引用
收藏
页码:7368 / 7379
页数:12
相关论文
共 50 条
  • [41] AUTOMATED MACHINE LEARNING-BASED DIAGNOSIS AND MOLECULAR CHARACTERIZATION OF ACUTE LEUKEMIAS USING FLOW CYTOMETRY DATA
    Lewis, Joshua E.
    Cooper, Lee A. D.
    Jaye, David L.
    Pozdnyakova, Olga
    INTERNATIONAL JOURNAL OF LABORATORY HEMATOLOGY, 2023, 45 : 69 - 69
  • [42] Machine learning-based approaches to Vis-NIR data for the automated characterization of petroleum wax blends
    Barea-Sepulveda, Marta
    Calle, Jose Luis P.
    Ferreiro-Gonzalez, Marta
    Palma, Miguel
    SPECTROCHIMICA ACTA PART A-MOLECULAR AND BIOMOLECULAR SPECTROSCOPY, 2024, 310
  • [43] Automated Machine Learning-Based Diagnosis and Molecular Characterization of Acute Leukemias using Flow Cytometry Data
    Lewis, Joshua
    Cooper, Lee
    Jaye, David
    Pozdnyakova, Olga
    AMERICAN JOURNAL OF CLINICAL PATHOLOGY, 2023, 160 : S119 - S120
  • [44] A machine learning-based choledocholithiasis prediction tool to improve ERCP decision making: a proof-of-concept study
    Steinway, Steven N.
    Tang, Bohao
    Telezing, Jeremy
    Ashok, Aditya
    Kamal, Ayesha
    Yu, Chung Yao
    Jagtap, Nitin
    Buxbaum, James L.
    Elmunzer, Joseph
    Wani, Sachin B.
    Khashab, Mouen A.
    Caffo, Brian S.
    Akshintala, Venkata S.
    ENDOSCOPY, 2024, 56 (03) : 165 - 171
  • [45] Machine Learning-Based Coding Decision Making in H.265/HEVC CTU Division and Intra Prediction
    Jiang, Wenchan
    Yang, Ming
    Xie, Ying
    Li, Zhigang
    INTERNATIONAL JOURNAL OF MOBILE COMPUTING AND MULTIMEDIA COMMUNICATIONS, 2020, 11 (02) : 41 - 60
  • [46] Machine Learning-Based Automated Fault Detection and Diagnostics in Building Systems
    Nelson, William
    Dieckert, Christopher
    ENERGIES, 2024, 17 (02)
  • [47] Towards A Machine Learning-Based Framework For Automated Design of Networking Protocols
    Pasandi, Hannaneh Barahouei
    2019 IEEE INTERNATIONAL CONFERENCE ON PERVASIVE COMPUTING AND COMMUNICATIONS WORKSHOPS (PERCOM WORKSHOPS), 2019, : 433 - 434
  • [48] Machine learning-based technology for asphalt concrete pavement performance decision-making in hot and humid climates
    Mansour, Elise
    Dhasmana, Heena
    Mousa, Momen R.
    Hassan, Marwa
    CONSTRUCTION AND BUILDING MATERIALS, 2024, 442
  • [49] Towards a Machine Learning-based Model for Automated Crop Type Mapping
    Dakir, Asmae
    Barramou, Fatimazahra
    Alami, Omar Bachir
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2023, 14 (01) : 772 - 779
  • [50] Automated machine learning-based building energy load prediction method
    Zhang, Chaobo
    Tian, Xiangning
    Zhao, Yang
    Lu, Jie
    JOURNAL OF BUILDING ENGINEERING, 2023, 80