Scaling Up Crowd-Sourcing to Very Large Datasets: A Case for Active Learning

被引:96
|
作者
Mozafari, Barzan [1 ]
Sarkar, Purna [2 ]
Franklin, Michael [1 ,2 ]
Jordan, Michael [1 ]
Madden, Samuel [3 ]
机构
[1] Univ Michigan, Ann Arbor, MI 48109 USA
[2] Univ Calif Berkeley, Berkeley, CA 94720 USA
[3] MIT, Cambridge, MA 02139 USA
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2014年 / 8卷 / 02期
关键词
D O I
10.14778/2735471.2735474
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Crowd-sourcing has become a popular means of acquiring labeled data for many tasks where humans are more accurate than computers, such as image tagging, entity resolution, and sentiment analysis. I lowever, due to the time and cost of human labor, solutions that rely solely on crowd-sourcing are often limited to small datasets (i.e., a few thousand items). This paper proposes algorithms for integrating machine learning into crowd-sourced databases in order to combine the accuracy of human labeling with the speed and costeffectiveness of machine learning classifiers. 1.3y using active learning as our optimization strategy for labeling tasks in crowd-sourced databases, we can minimize the number of questions asked to the crowd, allowing crowd-sourced applications to scale (i.e., label much larger datasets at lower costs). Designing active learning algorithms for a crowd-sourced database poses manypractical challenges: such algorithms need to be generic, scalable, and easy to use, even for practitioners who are not machine learning experts. We draw on the theory of nonparametric bootstrap to design, to the best of our knowledge, the first active learning algorithms that meet all these requirements. Our results, on 3 real-world datasets collected with Amazons Mechanical Turk, and on 15 UCI datasets, show that our methods on average ask 1-2 orders of magnitude fewer questions than the baseline, and 4.5-44x fewer than existing active learning algorithms.
引用
收藏
页码:125 / 136
页数:12
相关论文
共 50 条
  • [21] Computational Models of Consumer Confidence from Large-Scale Online Attention Data: Crowd-Sourcing Econometrics
    Dong, Xianlei
    Bollen, Johan
    PLOS ONE, 2015, 10 (03):
  • [22] Trial2rev: Combining machine learning and crowd-sourcing to create a shared space for updating systematic reviews
    Martin, Paige
    Surian, Didi
    Bashir, Rabia
    Bourgeois, Florence T.
    Dunn, Adam G.
    JAMIA OPEN, 2019, 2 (01) : 15 - 22
  • [23] Measuring and modelling perceptions of the built environment for epidemiological research using crowd-sourcing and image-based deep learning models
    Andrew Larkin
    Ajay Krishna
    Lizhong Chen
    Ofer Amram
    Ally R. Avery
    Glen E. Duncan
    Perry Hystad
    Journal of Exposure Science & Environmental Epidemiology, 2022, 32 : 892 - 899
  • [24] Measuring and modelling perceptions of the built environment for epidemiological research using crowd-sourcing and image-based deep learning models
    Larkin, Andrew
    Krishna, Ajay
    Chen, Lizhong
    Amram, Ofer
    Avery, Ally R.
    Duncan, Glen E.
    Hystad, Perry
    JOURNAL OF EXPOSURE SCIENCE AND ENVIRONMENTAL EPIDEMIOLOGY, 2022, 32 (06) : 892 - 899
  • [25] Active learning in very large databases
    Panda, Navneet
    Goh, King-Shy
    Chang, Edward Y.
    MULTIMEDIA TOOLS AND APPLICATIONS, 2006, 31 (03) : 249 - 267
  • [26] Active learning in very large databases
    Navneet Panda
    King-Shy Goh
    Edward Y. Chang
    Multimedia Tools and Applications, 2006, 31 : 249 - 267
  • [27] COVID-19 pandemic changes the recreational use of Moscow parks in space and time: Outcomes from crowd-sourcing and machine learning
    Matasov, Victor
    Vasenev, Viacheslav
    Matasov, Dmitrii
    Dvornikov, Yury
    Filyushkina, Anna
    Bubalo, Martina
    Nakhaev, Magomed
    Konstantinova, Anastasia
    URBAN FORESTRY & URBAN GREENING, 2023, 83
  • [28] Development and Pilot of Novel Process Using Machine Learning and Crowd-Sourcing to Conduct a Living Systematic Review of Rheumatoid Arthritis Drug Therapy
    Lee, Chloe
    Thomas, Megan
    Whittle, Samuel
    Buchbinder, Rachelle
    Kamso, Mohammed
    Pardo, Jordi
    Hazlewood, Glen
    JOURNAL OF RHEUMATOLOGY, 2020, 47 (07) : 1068 - 1068
  • [29] ACTIVE LEARNING ON LARGE HYPERSPECTRAL DATASETS: A PREPROCESSING METHOD
    Thoreau, R.
    Achard, V
    Risser, L.
    Berthelot, B.
    Briottet, X.
    XXIV ISPRS CONGRESS: IMAGING TODAY, FORESEEING TOMORROW, COMMISSION III, 2022, 43-B3 : 435 - 442
  • [30] ASTRAL-MP: scaling ASTRAL to very large datasets using randomization and parallelization
    Yin, John
    Zhang, Chao
    Mirarab, Siavash
    BIOINFORMATICS, 2019, 35 (20) : 3961 - 3969