Scaling Up Crowd-Sourcing to Very Large Datasets: A Case for Active Learning

被引：96

作者：

Mozafari, Barzan ^{[1
]}

Sarkar, Purna ^{[2
]}

Franklin, Michael ^{[1
,2
]}

Jordan, Michael ^{[1
]}

Madden, Samuel ^{[3
]}

机构：

[1] Univ Michigan, Ann Arbor, MI 48109 USA

[2] Univ Calif Berkeley, Berkeley, CA 94720 USA

[3] MIT, Cambridge, MA 02139 USA

来源：

PROCEEDINGS OF THE VLDB ENDOWMENT | 2014年 / 8卷 / 02期

关键词：

D O I：

10.14778/2735471.2735474

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Crowd-sourcing has become a popular means of acquiring labeled data for many tasks where humans are more accurate than computers, such as image tagging, entity resolution, and sentiment analysis. I lowever, due to the time and cost of human labor, solutions that rely solely on crowd-sourcing are often limited to small datasets (i.e., a few thousand items). This paper proposes algorithms for integrating machine learning into crowd-sourced databases in order to combine the accuracy of human labeling with the speed and costeffectiveness of machine learning classifiers. 1.3y using active learning as our optimization strategy for labeling tasks in crowd-sourced databases, we can minimize the number of questions asked to the crowd, allowing crowd-sourced applications to scale (i.e., label much larger datasets at lower costs). Designing active learning algorithms for a crowd-sourced database poses manypractical challenges: such algorithms need to be generic, scalable, and easy to use, even for practitioners who are not machine learning experts. We draw on the theory of nonparametric bootstrap to design, to the best of our knowledge, the first active learning algorithms that meet all these requirements. Our results, on 3 real-world datasets collected with Amazons Mechanical Turk, and on 15 UCI datasets, show that our methods on average ask 1-2 orders of magnitude fewer questions than the baseline, and 4.5-44x fewer than existing active learning algorithms.

引用

页码：125 / 136

页数：12

共 50 条

[21] Computational Models of Consumer Confidence from Large-Scale Online Attention Data: Crowd-Sourcing Econometrics
Dong, Xianlei
Bollen, Johan
PLOS ONE, 2015, 10 (03):
[22] Trial2rev: Combining machine learning and crowd-sourcing to create a shared space for updating systematic reviews
Martin, Paige
Surian, Didi
Bashir, Rabia
Bourgeois, Florence T.
Dunn, Adam G.
JAMIA OPEN, 2019, 2 (01) : 15 - 22
[23] Measuring and modelling perceptions of the built environment for epidemiological research using crowd-sourcing and image-based deep learning models
Andrew Larkin
Ajay Krishna
Lizhong Chen
Ofer Amram
Ally R. Avery
Glen E. Duncan
Perry Hystad
Journal of Exposure Science & Environmental Epidemiology, 2022, 32 : 892 - 899
[24] Measuring and modelling perceptions of the built environment for epidemiological research using crowd-sourcing and image-based deep learning models
Larkin, Andrew
Krishna, Ajay
Chen, Lizhong
Amram, Ofer
Avery, Ally R.
Duncan, Glen E.
Hystad, Perry
JOURNAL OF EXPOSURE SCIENCE AND ENVIRONMENTAL EPIDEMIOLOGY, 2022, 32 (06) : 892 - 899
[25] Active learning in very large databases
Panda, Navneet
Goh, King-Shy
Chang, Edward Y.
MULTIMEDIA TOOLS AND APPLICATIONS, 2006, 31 (03) : 249 - 267
[26] Active learning in very large databases
Navneet Panda
King-Shy Goh
Edward Y. Chang
Multimedia Tools and Applications, 2006, 31 : 249 - 267
[27] COVID-19 pandemic changes the recreational use of Moscow parks in space and time: Outcomes from crowd-sourcing and machine learning
Matasov, Victor
Vasenev, Viacheslav
Matasov, Dmitrii
Dvornikov, Yury
Filyushkina, Anna
Bubalo, Martina
Nakhaev, Magomed
Konstantinova, Anastasia
URBAN FORESTRY & URBAN GREENING, 2023, 83
[28] Development and Pilot of Novel Process Using Machine Learning and Crowd-Sourcing to Conduct a Living Systematic Review of Rheumatoid Arthritis Drug Therapy
Lee, Chloe
Thomas, Megan
Whittle, Samuel
Buchbinder, Rachelle
Kamso, Mohammed
Pardo, Jordi
Hazlewood, Glen
JOURNAL OF RHEUMATOLOGY, 2020, 47 (07) : 1068 - 1068
[29] ACTIVE LEARNING ON LARGE HYPERSPECTRAL DATASETS: A PREPROCESSING METHOD
Thoreau, R.
Achard, V
Risser, L.
Berthelot, B.
Briottet, X.
XXIV ISPRS CONGRESS: IMAGING TODAY, FORESEEING TOMORROW, COMMISSION III, 2022, 43-B3 : 435 - 442
[30] ASTRAL-MP: scaling ASTRAL to very large datasets using randomization and parallelization
Yin, John
Zhang, Chao
Mirarab, Siavash
BIOINFORMATICS, 2019, 35 (20) : 3961 - 3969

← 1 2 3 4 5 →