Retiring Adult: New Datasets for Fair Machine Learning

被引:0
|
作者
Ding, Frances [1 ]
Hardt, Moritz [1 ]
Miller, John [1 ]
Schmidt, Ludwig [2 ]
机构
[1] Univ Calif Berkeley, Berkeley, CA 94720 USA
[2] Toyota Res Inst, Toyota, Japan
基金
美国国家科学基金会;
关键词
BIAS;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Although the fairness community has recognized the importance of data, researchers in the area primarily rely on UCI Adult when it comes to tabular data. Derived from a 1994 US Census survey, this dataset has appeared in hundreds of research papers where it served as the basis for the development and comparison of many algorithmic fairness interventions. We reconstruct a superset of the UCI Adult data from available US Census sources and reveal idiosyncrasies of the UCI Adult dataset that limit its external validity. Our primary contribution is a suite of new datasets derived from US Census surveys that extend the existing data ecosystem for research on fair machine learning. We create prediction tasks relating to income, employment, health, transportation, and housing. The data span multiple years and all states of the United States, allowing researchers to study temporal shift and geographic variation. We highlight a broad initial sweep of new empirical insights relating to trade-offs between fairness criteria, performance of algorithmic interventions, and the role of distribution shift based on our new datasets. Our findings inform ongoing debates, challenge some existing narratives, and point to future research directions.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] A review of the machine learning datasets in mammography, their adherence to the FAIR principles and the outlook for the future
    Joe Logan
    Paul J. Kennedy
    Daniel Catchpoole
    Scientific Data, 10
  • [2] A review of the machine learning datasets in mammography, their adherence to the FAIR principles and the outlook for the future
    Logan, Joe
    Kennedy, Paul J.
    Catchpoole, Daniel
    SCIENTIFIC DATA, 2023, 10 (01)
  • [3] Revolutionizing machine learning: Blockchain-based crowdsourcing for transparent and fair labeled datasets supply
    Xu, Haitao
    He, Zheng
    Lan, Dapeng
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2024, 161 : 106 - 118
  • [4] A machine learning toolkit for subtyping cancer in existing and new datasets
    Lee, Jordan
    NATURE REVIEWS CANCER, 2025,
  • [5] Fair Algorithms for Machine Learning
    Kearns, Michael
    EC'17: PROCEEDINGS OF THE 2017 ACM CONFERENCE ON ECONOMICS AND COMPUTATION, 2017, : 1 - 1
  • [6] Quantum Fair Machine Learning
    Perrier, Elija
    AIES '21: PROCEEDINGS OF THE 2021 AAAI/ACM CONFERENCE ON AI, ETHICS, AND SOCIETY, 2021, : 843 - 853
  • [7] Paradoxes in Fair Machine Learning
    Golz, Paul
    Kahng, Anson
    Procaccia, Ariel D.
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [8] Automated Workflows for Machine Learning on Photovoltaic Timeseries and UV Fluorescence Image Datasets Using FAIR Principles
    Oltjen, William C.
    Yu, Xuanji
    Li, Mengjie
    Colvin, Dylan J.
    Sun, Yijia
    Seigneur, Hubert
    Knodle, Philip
    Gabor, Andrew M.
    Bruckman, Laura S.
    Davis, Kristopher O.
    French, Roger H.
    2023 IEEE 50TH PHOTOVOLTAIC SPECIALISTS CONFERENCE, PVSC, 2023,
  • [9] Spectral methods in machine learning and new strategies for very large datasets
    Belabbas, Mohamed-Ali
    Wolfe, Patrick J.
    PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2009, 106 (02) : 369 - 374
  • [10] New Techniques in Profiling Big Datasets for Machine Learning with A Concise Review of Android Mobile Malware Datasets
    Canbek, Gurol
    Sagiroglu, Seref
    Temizel, Tugba Taskaya
    2018 INTERNATIONAL CONGRESS ON BIG DATA, DEEP LEARNING AND FIGHTING CYBER TERRORISM (IBIGDELFT), 2018, : 117 - 121