Runtime Data Layout Scheduling for Machine Learning Dataset

被引：5

作者：

You, Yang ^{[1
]}

Demmel, James ^{[1
]}

机构：

[1] Univ Calif Berkeley, Div Comp Sci, Berkeley, CA 94720 USA

来源：

2017 46TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP) | 2017年

关键词：

parallel auto-tuning; machine learning;

D O I：

10.1109/ICPP.2017.54

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Machine Learning (ML) approaches are widely-used classification/regression methods for data mining applications. However, the time-consuming training process greatly limits the efficiency of ML approaches. We use the example of SVM (traditional ML algorithm) and DNN (state-of-the-art ML algorithm) to illustrate the idea in this paper. For SVM, a major performance bottleneck of current tools is that they use a unified data storage format because the data formats can have a significant influence on the complexity of storage and computation, memory bandwidth, and the efficiency of parallel processing. To address the problem above, we study the factors influencing the algorithm's performance and conduct auto-tuning to speed up SVM training. DNN training is even slower than SVM. For example, using a 8-core CPUs to train AlexNet model by CIFAR-10 dataset costs 8.2 hours. CIFAR-10 is only 170 MB, which is not efficient for distributed processing. Moreover, due to the algorithm limitation, only a small batch of data can be processed at each iteration. We focus on finding the right algorithmic parameters and using auto-tuning techniques to make the algorithm run faster. For SVM training, our implementation achieves 1.7-16.3x speedup (6.8x on average) against the non-adaptive case (using the worst data format) for various datasets. For DNN training on CIFAR-10 dataset, we reduce the time from 8.2 hours to only roughly 1 minute. We use the benchmark of dollars per speedup to help the users to select the right deep learning hardware.

引用

页码：452 / 461

页数：10

共 50 条

[41] DescribeML: A dataset description tool for machine learning
Giner-Miguelez, Joan
Gomez, Abel
Cabot, Jordi
SCIENCE OF COMPUTER PROGRAMMING, 2024, 231
[42] Measuring and Visualizing Dataset Coverage for Machine Learning
Kuhn, D. Richard
Raunak, M. S.
Kacker, Raghu N.
COMPUTER, 2025, 58 (04) : 18 - 26
[43] Machine Learning on Volatile Instances: Convergence, Runtime, and Cost Tradeoffs
Zhang, Xiaoxi
Wang, Jianyu
Lee, Li-Feng
Yang, Tom
Kalra, Akansha
Joshi, Gauri
Joe-Wong, Carlee
IEEE-ACM TRANSACTIONS ON NETWORKING, 2022, 30 (01) : 215 - 228
[44] Runtime Optimizations for Tree-based Machine Learning Models
Asadi, Nima
Lin, Jimmy
de Vries, Arjen P.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2014, 26 (09) : 2281 - 2292
[45] INSTANCE - the Italian seismic dataset for machine learning
Michelini, Alberto
Cianetti, Spina
Gaviano, Sonja
Giunchi, Carlo
Jozinovic, Dario
Lauciani, Valentino
EARTH SYSTEM SCIENCE DATA, 2021, 13 (12) : 5509 - 5544
[46] Fast and simple dataset selection for machine learning
Peter, Timm J.
Nelles, Oliver
AT-AUTOMATISIERUNGSTECHNIK, 2019, 67 (10) : 833 - 842
[47] Handling Imbalanced Dataset Classification in Machine Learning
Yadav, Seema
Bhole, Girish P.
2020 IEEE PUNE SECTION INTERNATIONAL CONFERENCE (PUNECON), 2020, : 38 - 43
[48] Surgical scheduling via optimization and machine learning with long-tailed data
Shi, Yuan
Mahdian, Saied
Blanchet, Jose
Glynn, Peter
Shin, Andrew Y.
Scheinker, David
HEALTH CARE MANAGEMENT SCIENCE, 2023, 26 (04) : 692 - 718
[49] Dataset of cannabis seeds for machine learning applications
Chumchu, Prawit
Patil, Kailas
DATA IN BRIEF, 2023, 47
[50] SCHEDULING THE ALLOCATION OF DATA FRAGMENTS IN A DISTRIBUTED DATABASE ENVIRONMENT - A MACHINE LEARNING APPROACH
CHATURVEDI, AR
CHOUBEY, AK
ROAN, JS
IEEE TRANSACTIONS ON ENGINEERING MANAGEMENT, 1994, 41 (02) : 194 - 207

← 1 2 3 4 5 →