Optimizing Machine Learning on Apache Spark in HPC Environments

被引:0
|
作者
Li, Zhenyu [1 ]
Davis, James [1 ]
Jarvis, Stephen A. [1 ]
机构
[1] Univ Warwick, Dept Comp Sci, Coventry, W Midlands, England
基金
英国工程与自然科学研究理事会;
关键词
Machine Learning; High Performance Computing; Apache Spark; All-Reduce; Asynchronous Stochastic Gradient Descent; MAPREDUCE;
D O I
10.1109/MLHPC.2018.00006
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Machine learning has established itself as a powerful tool for the construction of decision making models and algorithms through the use of statistical techniques on training data. However, a significant impediment to its progress is the time spent training and improving the accuracy of these models this is a data and compute intensive process, which can often take days, weeks or even months to complete. A common approach to accelerate this process is to employ the use of multiple machines simultaneously, a trait shared with the field of High Performance Computing (HPC) and its clusters. However, existing distributed frameworks for data analytics and machine learning are designed for commodity servers, which do not realize the full potential of a HPC cluster, and thus denies the effective use of a readily available and potentially useful resource. In this work we adapt the application of Apache Spark, a distributed data-flow framework, to support the use of machine learning in HPC environments for the purposes of machine learning. There are inherent challenges to using Spark in this context; memory management, communication costs and synchronization overheads all pose challenges to its efficiency. To this end we introduce: (i) the application of MapRDD, a fine grained distributed data representation; (ii) a task-based all-reduce implementation; and (iii) a new asynchronous Stochastic Gradient Descent (SGD) algorithm using non-blocking all-reduce. We demonstrate up to a 2.6x overall speedup (or a 11.2x theoretical speedup with a Nvidia K80 graphics card), a 82-91% compute ratio, and a 80% reduction in the memory usage, when training the GoogLeNet model to classify 10% of the ImageNet dataset on a 32-node cluster. We also demonstrate a comparable convergence rate using the new asynchronous SGD with respect to the synchronous method. With increasing use of accelerator cards, larger cluster computers and deeper neural network models, we predict a 2x further speedup (i.e. 22.4x accumulated speedup) is obtainable with the new asynchronous SGD algorithm on heterogeneous clusters.
引用
收藏
页码:95 / 105
页数:11
相关论文
共 50 条
  • [41] Optimizing Speculative Execution in Spark Heterogeneous Environments
    Fu, Zhongming
    Tang, Zhuo
    IEEE TRANSACTIONS ON CLOUD COMPUTING, 2022, 10 (01) : 568 - 582
  • [42] Scalable Manifold Learning for Big Data with Apache Spark
    Schoeneman, Frank
    Zola, Jaroslaw
    2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2018, : 272 - 281
  • [43] Distributed Tree-Based Machine Learning for Short-Term Load Forecasting With Apache Spark
    Zainab, Ameema
    Ghrayeb, Ali
    Abu-Rub, Haitham
    Refaat, Shady S.
    Bouhali, Othmane
    IEEE ACCESS, 2021, 9 : 57372 - 57384
  • [44] Prediction of Cardiovascular Risk Using Extreme Learning Machine-Tree Classifier on Apache Spark Cluster
    Jaya Lakshmi A.
    Venkatramaphanikumar S.
    Kolli V.K.K.
    Recent Advances in Computer Science and Communications, 2022, 15 (03) : 443 - 455
  • [45] Scalable Time Series Classification in streaming and batch environments on Apache Spark
    Glenis, Apostolos
    2020 11TH INTERNATIONAL CONFERENCE ON INFORMATION, INTELLIGENCE, SYSTEMS AND APPLICATIONS (IISA 2020), 2020, : 104 - 111
  • [46] Analyzing and Optimizing Java']Java Code Generation for Apache Spark Query Plan
    Ishizaki, Kazuaki
    PROCEEDINGS OF THE 2019 ACM/SPEC INTERNATIONAL CONFERENCE ON PERFORMANCE ENGINEERING (ICPE '19), 2019, : 91 - 102
  • [47] SparkDWM: a scalable design of a Data Washing Machine using Apache Spark
    Hagan, Nicholas Kofi Akortia
    Talburt, John R.
    FRONTIERS IN BIG DATA, 2024, 7
  • [48] Phase Annotated Learning for Apache Spark: Workload Recognition and Characterization
    Jandaghi, SeyedAli Jokar
    Bhattacharyya, Arnamoy
    Amza, Cristiana
    2018 16TH IEEE INTERNATIONAL CONFERENCE ON CLOUD COMPUTING TECHNOLOGY AND SCIENCE (CLOUDCOM 2018), 2018, : 9 - 16
  • [49] Distributed Information-Theoretic Metric Learning in Apache Spark
    Su, Yuxin
    Yang, Haiqin
    King, Irwin
    Lyu, Michael
    2016 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2016, : 3306 - 3313
  • [50] Ensemble Learning for Large Scale Virtual Screening on Apache Spark
    Sid, Karima
    Batouche, Mohamed
    COMPUTATIONAL INTELLIGENCE AND ITS APPLICATIONS, 2018, 522 : 244 - 256