Automated Data Cleaning can Hurt Fairness in Machine Learning-Based Decision Making

被引:0
|
作者
Guha, Shubha [1 ]
Khan, Falaah Arif [2 ]
Stoyanovich, Julia [2 ]
Schelter, Sebastian [1 ]
机构
[1] Univ Amsterdam, NL-1012 WP Amsterdam, Netherlands
[2] NYU, New York, NY 10012 USA
关键词
Cleaning; Data models; Data integrity; Task analysis; Production; Decision making; Medical services; Responsible data management; data cleaning; data preparation; fairness in machine learning;
D O I
10.1109/TKDE.2024.3365524
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we interrogate whether data quality issues track demographic group membership (based on sex, race and age) and whether automated data cleaning - of the kind commonly used in production ML systems - impacts the fairness of predictions made by these systems. To the best of our knowledge, the impact of data cleaning on fairness in downstream tasks has not been investigated in the literature. We first analyse the tuples flagged by common error detection strategies in five research datasets. We find that, while specific data quality issues, such as higher rates of missing values, are associated with membership in historically disadvantaged groups, poor data quality does not generally track demographic group membership. As a follow-up, we conduct a large-scale empirical study on the impact of automated data cleaning on fairness, involving more than 26,000 model evaluations. We observe that, while automated data cleaning is unlikely to worsen accuracy, it is more likely to worsen fairness than to improve it, especially when the cleaning techniques are not carefully chosen. Furthermore, we find that the positive or negative impact of a particular cleaning technique often depends on the choice of fairness metric and group definition (single-attribute or intersectional). We make our code and experimental results publicly available. The analysis we conducted in this paper is difficult, primarily because it requires that we think holistically about disparities in data quality, disparities in the effectiveness of data cleaning methods, and impacts of such disparities on ML model performance for different demographic groups. Such holistic analysis can and should be supported by data engineering tools, and requires substantial data engineering research. Towards this goal, we discuss open research questions, envision the development of fairness-aware data cleaning methods, and their integration into complex pipelines for ML-based decision making.
引用
收藏
页码:7368 / 7379
页数:12
相关论文
共 50 条
  • [21] KisanQRS: A deep learning-based automated query-response system for agricultural decision-making
    Rehman, Mohammad Zia Ur
    Raghuvanshi, Devraj
    Kumar, Nagendra
    COMPUTERS AND ELECTRONICS IN AGRICULTURE, 2023, 213
  • [22] A Machine Learning-based Approach for Automated Vulnerability Remediation Analysis
    Zhang, Fengli
    Huff, Philip
    McClanahan, Kylie
    Li, Qinghua
    2020 IEEE CONFERENCE ON COMMUNICATIONS AND NETWORK SECURITY (CNS), 2020,
  • [23] MAARS: Machine learning-based Analytics for Automated Rover Systems
    Ono, Masahiro
    Rothrock, Brandon
    Otsu, Kyohei
    Higa, Shoya
    Iwashita, Yumi
    Didier, Annie
    Islam, Tanvir
    Laporte, Christopher
    Sun, Vivian
    Stack, Kathryn
    Sawoniewicz, Jacek
    Daftry, Shreyansh
    Timmaraju, Virisha
    Sahnoune, Sami
    Mattmann, Chris A.
    Lamarre, Olivier
    Ghosh, Sourish
    Qiu, Dicong
    Nomura, Shunichiro
    Roy, Hiya
    Sarabu, Hemanth
    Hedrick, Gabrielle
    Folsom, Larkin
    Suehr, Sean
    Park, Hyoshin
    2020 IEEE AEROSPACE CONFERENCE (AEROCONF 2020), 2020,
  • [24] An Automated Python Script for Data Cleaning and Labeling using Machine Learning Technique
    Oladipupo M.A.
    Obuzor P.C.
    Bamgbade B.J.
    Olagunju K.M.
    Adeniyi A.E.
    Ajagbe S.A.
    Informatica (Slovenia), 2023, 47 (06): : 219 - 232
  • [25] Clinical Decision Making Using Machine Learning and ICU Data
    Tembhurne, Saurabh P.
    Neware, Shubhangi
    HELIX, 2018, 8 (05): : 4082 - 4087
  • [26] Data set quality in Machine Learning: Consistency measure based on Group Decision Making
    Fenza, Giuseppe
    Gallo, Mariacristina
    Loia, Vincenzo
    Orciuoli, Francesco
    Herrera-Viedma, Enrique
    APPLIED SOFT COMPUTING, 2021, 106
  • [27] An intelligent decision-making system for embryo transfer in reproductive technology: a machine learning-based approach
    Badr, Sanaa
    Tahri, Meryem
    Maanan, Mohamed
    Kaspar, Jan
    Yousfi, Noura
    SYSTEMS BIOLOGY IN REPRODUCTIVE MEDICINE, 2025, 71 (01) : 13 - 28
  • [28] Histological interpretation of spitzoid tumours: an extensive machine learning-based concordance analysis for improving decision making
    Mosquera-Zamudio, Andres
    Launet, Laetitia
    Colomer, Adrian
    Wiedemeyer, Katharina
    Lopez-Takegami, Juan C.
    Palma, Luis F.
    Undersrud, Erling
    Janssen, Emilius
    Brenn, Thomas
    Naranjo, Valery
    Monteagudo, Carlos
    HISTOPATHOLOGY, 2024, 85 (01) : 155 - 170
  • [29] IntelliDaM: A Machine Learning-Based Framework for Enhancing the Performance of Decision-Making Processes. A Case Study for Educational Data Mining
    Czibula, Gabriela
    Ciubotariu, George
    Maier, Mariana-Ioana
    Lisei, Hannelore
    IEEE ACCESS, 2022, 10 : 80651 - 80666
  • [30] Learning-Based Decentralized Offloading Decision Making in an Adversarial Environment
    Cho, Byungjin
    Xiao, Yu
    IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, 2021, 70 (11) : 11308 - 11323