Runtime Data Layout Scheduling for Machine Learning Dataset

被引:5
|
作者
You, Yang [1 ]
Demmel, James [1 ]
机构
[1] Univ Calif Berkeley, Div Comp Sci, Berkeley, CA 94720 USA
关键词
parallel auto-tuning; machine learning;
D O I
10.1109/ICPP.2017.54
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Machine Learning (ML) approaches are widely-used classification/regression methods for data mining applications. However, the time-consuming training process greatly limits the efficiency of ML approaches. We use the example of SVM (traditional ML algorithm) and DNN (state-of-the-art ML algorithm) to illustrate the idea in this paper. For SVM, a major performance bottleneck of current tools is that they use a unified data storage format because the data formats can have a significant influence on the complexity of storage and computation, memory bandwidth, and the efficiency of parallel processing. To address the problem above, we study the factors influencing the algorithm's performance and conduct auto-tuning to speed up SVM training. DNN training is even slower than SVM. For example, using a 8-core CPUs to train AlexNet model by CIFAR-10 dataset costs 8.2 hours. CIFAR-10 is only 170 MB, which is not efficient for distributed processing. Moreover, due to the algorithm limitation, only a small batch of data can be processed at each iteration. We focus on finding the right algorithmic parameters and using auto-tuning techniques to make the algorithm run faster. For SVM training, our implementation achieves 1.7-16.3x speedup (6.8x on average) against the non-adaptive case (using the worst data format) for various datasets. For DNN training on CIFAR-10 dataset, we reduce the time from 8.2 hours to only roughly 1 minute. We use the benchmark of dollars per speedup to help the users to select the right deep learning hardware.
引用
收藏
页码:452 / 461
页数:10
相关论文
共 50 条
  • [1] Advanced Machine Learning for Runtime Data Generation
    Zamir, Bukhtawar
    Campos, Joao R.
    Vieira, Marco
    PROCEEDINGS OF12TH LATIN-AMERICAN SYMPOSIUM ON DEPENDABLE AND SECURE COMPUTING, LADC 2023, 2023, : 182 - 187
  • [2] Adaptive OpenMP Task Scheduling Using Runtime APIs and Machine Learning
    Qawasmeh, Ahmad R.
    Malik, Abid M.
    Chapman, Barbara M.
    2015 IEEE 14TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), 2015, : 889 - 895
  • [3] Machine Learning Based Online Performance Prediction for Runtime Parallelization and Task Scheduling
    Li, Jiangtian
    Ma, Xiaosong
    Singh, Karan
    Schulz, Martin
    de Supinski, Bronis R.
    McKee, Sally A.
    ISPASS 2009: IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE, 2009, : 89 - +
  • [4] A machine learning dataset for FRB detection in raw data
    Xu, ZhiJun
    An, Tao
    Guo, ShaoGuang
    Lao, BaoQiang
    Lv, WeiJia
    Wu, XiaoCong
    SCIENTIA SINICA-PHYSICA MECHANICA & ASTRONOMICA, 2023, 53 (02)
  • [5] Exploratory Data Analysis and Machine Learning on Titanic Disaster Dataset
    Singh, Karman
    Nagpal, Renuka
    Sehgal, Rajni
    PROCEEDINGS OF THE CONFLUENCE 2020: 10TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING, DATA SCIENCE & ENGINEERING, 2020, : 320 - 326
  • [6] Reintroducing KAPD as a Dataset for Machine Learning and Data Mining Applications
    Seddiq, Yasser
    Meftah, Ali
    Alghamdi, Mansour
    Alotaibi, Yousef
    UKSIM-AMSS 10TH EUROPEAN MODELLING SYMPOSIUM ON COMPUTER MODELLING AND SIMULATION (EMS), 2016, : 70 - 74
  • [7] Optimizing scheduling stability for runtime data alignment
    Hsu, Ching-Hsien
    Lan, Chao-Yang
    Chen, Shih-Chang
    EMERGING DIRECTIONS IN EMBEDDED AND UBIQUITOUS COMPUTING, 2006, 4097 : 825 - 835
  • [8] Dataset Shift in Machine Learning
    Adams, Niall
    JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES A-STATISTICS IN SOCIETY, 2010, 173 : 274 - 274
  • [9] Dataset Security for Machine Learning: Data Poisoning, Backdoor Attacks, and Defenses
    Goldblum, Micah
    Tsipras, Dimitris
    Xie, Chulin
    Chen, Xinyun
    Schwarzschild, Avi
    Song, Dawn
    Madry, Aleksander
    Li, Bo
    Goldstein, Tom
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (02) : 1563 - 1580
  • [10] FOWD: A Free Ocean Wave Dataset for Data Mining and Machine Learning
    Hafner, Dion
    Gemmrich, Johannes
    Jochum, Markus
    JOURNAL OF ATMOSPHERIC AND OCEANIC TECHNOLOGY, 2021, 38 (07) : 1305 - 1322