Runtime Data Layout Scheduling for Machine Learning Dataset

被引:5
|
作者
You, Yang [1 ]
Demmel, James [1 ]
机构
[1] Univ Calif Berkeley, Div Comp Sci, Berkeley, CA 94720 USA
关键词
parallel auto-tuning; machine learning;
D O I
10.1109/ICPP.2017.54
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Machine Learning (ML) approaches are widely-used classification/regression methods for data mining applications. However, the time-consuming training process greatly limits the efficiency of ML approaches. We use the example of SVM (traditional ML algorithm) and DNN (state-of-the-art ML algorithm) to illustrate the idea in this paper. For SVM, a major performance bottleneck of current tools is that they use a unified data storage format because the data formats can have a significant influence on the complexity of storage and computation, memory bandwidth, and the efficiency of parallel processing. To address the problem above, we study the factors influencing the algorithm's performance and conduct auto-tuning to speed up SVM training. DNN training is even slower than SVM. For example, using a 8-core CPUs to train AlexNet model by CIFAR-10 dataset costs 8.2 hours. CIFAR-10 is only 170 MB, which is not efficient for distributed processing. Moreover, due to the algorithm limitation, only a small batch of data can be processed at each iteration. We focus on finding the right algorithmic parameters and using auto-tuning techniques to make the algorithm run faster. For SVM training, our implementation achieves 1.7-16.3x speedup (6.8x on average) against the non-adaptive case (using the worst data format) for various datasets. For DNN training on CIFAR-10 dataset, we reduce the time from 8.2 hours to only roughly 1 minute. We use the benchmark of dollars per speedup to help the users to select the right deep learning hardware.
引用
收藏
页码:452 / 461
页数:10
相关论文
共 50 条
  • [11] Exploration of Machine Learning and Data Mining techniques on a horse racing dataset
    Kyriacou, E
    Toolan, F
    Dunnion, J
    MLMTA '05: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON MACHINE LEARNING MODELS TECHNOLOGIES AND APPLICATIONS, 2005, : 161 - 166
  • [12] BIKED: A DATASET AND MACHINE LEARNING BENCHMARKS FOR DATA-DRIVEN BICYCLE DESIGN
    Regenwetter, Lyle
    Curry, Brent
    Ahmed, Faez
    PROCEEDINGS OF ASME 2021 INTERNATIONAL DESIGN ENGINEERING TECHNICAL CONFERENCES AND COMPUTERS AND INFORMATION IN ENGINEERING CONFERENCE, IDETC-CIE2021, VOL 3A, 2021,
  • [13] Exploring the use of machine learning techniques and synthetic data creation with CoCoBi dataset
    Pihlajamaki, Mika
    Silander, Kaisa
    Kantojarvi, Katri
    Eklund, Niina
    Wahlfors, Tiina
    EUROPEAN JOURNAL OF HUMAN GENETICS, 2024, 32 : 677 - 677
  • [14] TRAINING DATASET FOR THE MACHINE LEARNING APPROACH IN GLACIER MONITORING APPLYING SAR DATA
    Piwowar, Lukasz
    Lucka, Magdalena
    Witkowski, Wojciech
    IGARSS 2023 - 2023 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, 2023, : 191 - 194
  • [15] Training Dataset for the Machine Learning Approach in Glacier Monitoring Applying Sar Data
    Piwowar, Lukasz
    Lucka, Magdalena
    Witkowski, Wojciech
    International Geoscience and Remote Sensing Symposium (IGARSS), 2023, 2023-July : 191 - 194
  • [16] Citizens' data afterlives: Practices of dataset inclusion in machine learning for public welfare
    Ratner, Helene Friis
    Thylstrup, Nanna Bonde
    AI & SOCIETY, 2024, 40 (3) : 1183 - 1193
  • [17] Applying Machine Learning to Predict Film Daily Audience Data: System and Dataset
    Jiang, Luyao
    Hao, Yu
    2020 3RD INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND BIG DATA (ICAIBD 2020), 2020, : 11 - 16
  • [18] RMLIM: A Runtime Machine Learning Based Identification Model for Approximate Computing on Data Flow Graphs
    Wang, Ye
    Dong, Jian
    Liu, Yanxin
    Wang, Chunpei
    Qu, Gang
    IEEE TRANSACTIONS ON SUSTAINABLE COMPUTING, 2022, 7 (01): : 201 - 210
  • [19] A survey on dataset quality in machine learning
    Gong, Youdi
    Liu, Guangzhen
    Xue, Yunzhi
    Li, Rui
    Meng, Lingzhong
    INFORMATION AND SOFTWARE TECHNOLOGY, 2023, 162
  • [20] A benchmark dataset for machine learning in ecotoxicology
    Christoph Schür
    Lilian Gasser
    Fernando Perez-Cruz
    Kristin Schirmer
    Marco Baity-Jesi
    Scientific Data, 10