Towards an optimized distributed deep learning framework for a heterogeneous multi-GPU cluster

被引:17
|
作者
Kim, Youngrang [1 ]
Choi, Hyeonseong [1 ]
Lee, Jaehwan [1 ]
Kim, Jik-Soo [2 ]
Jei, Hyunseung [3 ]
Roh, Hongchan [3 ]
机构
[1] Korea Aerosp Univ, Goyang Si, South Korea
[2] Myongji Univ, Yongin, South Korea
[3] SK Telecom ML Infra Lab, Seongnam Si, South Korea
基金
新加坡国家研究基金会;
关键词
Data parallel; Distributed deep learning; Heterogeneous cluster; Large-scale deep learning;
D O I
10.1007/s10586-020-03144-9
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper presents a novel "Distributed Deep Learning Framework" for aheterogeneousmulti-GPU cluster that can effectively improve overall resource utilization without sacrificing training accuracy. Specifically, we employ a hybrid aggregation approach using a parameter-server and all-reduce schemes in order to address potential performance degradation problems in running deep learning applications on a heterogeneous computing system. In addition, we design and implement an asynchronous large mini-batch training mechanism to maintain training accuracy for asynchronous data-paralleled deep learning processing with enhanced collective communication capability based on MPI. We successfully implement our proposed framework on TensorFlow and perform extensive experiments in both of homogeneous and heterogeneous computing systems. Evaluation results show that our proposed framework can improve computing performance by decreasing I/O bottlenecks, and effectively increasing the resource utilization in the heterogeneous multi-GPU cluster.
引用
收藏
页码:2287 / 2300
页数:14
相关论文
共 50 条
  • [31] MARBLE: A Multi-GPU Aware Job Scheduler for Deep Learning on HPC Systems
    Han, Jingoo
    Rafique, M. Mustafa
    Xu, Luna
    Butt, Ali R.
    Lim, Seung-Hwan
    Vazhkudai, Sudharshan S.
    2020 20TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2020), 2020, : 272 - 281
  • [32] Analysis of large deviations behavior of multi-GPU memory access in deep learning
    Tamizharasan, P. S.
    Ramasubramanian, N.
    JOURNAL OF SUPERCOMPUTING, 2018, 74 (05): : 2199 - 2212
  • [33] Analysis of large deviations behavior of multi-GPU memory access in deep learning
    P. S. Tamizharasan
    N. Ramasubramanian
    The Journal of Supercomputing, 2018, 74 : 2199 - 2212
  • [34] CROSSBOW: Scaling Deep Learning with Small Batch Sizes on Multi-GPU Servers
    Koliousis, Alexandros
    Watcharapichat, Pijika
    Weidlich, Matthias
    Mai, Luo
    Costa, Paolo
    Pietzuch, Peter
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2019, 12 (11): : 1399 - 1413
  • [35] A Distributed Multi-GPU System for Fast Graph Processing
    Jia, Zhihao
    Kwon, Yongkee
    Shipman, Galen
    McCormick, Pat
    Erez, Mattan
    Aiken, Alex
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2017, 11 (03): : 297 - 310
  • [36] Towards Energy-Efficient Real-Time Scheduling of Heterogeneous Multi-GPU Systems
    Wang, Yidi
    Karimi, Mohsen
    Kim, Hyoseung
    2022 IEEE 43RD REAL-TIME SYSTEMS SYMPOSIUM (RTSS 2022), 2022, : 409 - 421
  • [37] Distributed training strategies for a computer vision deep learning algorithm on a distributed GPU cluster
    Campos, Victor
    Sastre, Francesc
    Yagues, Maurici
    Bellver, Miriam
    Giro-i-Nieto, Xavier
    Torres, Jordi
    INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE (ICCS 2017), 2017, 108 : 315 - 324
  • [38] Distributed Join Algorithms on Multi-GPU Clusters with GPUDirect RDMA
    Guo, Chengxin
    Chen, Hong
    Zhang, Feng
    Li, Cuiping
    PROCEEDINGS OF THE 48TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP 2019), 2019,
  • [39] Acceleration of 3D ECT image reconstruction in heterogeneous, multi-GPU, multi-node distributed system
    Majchrowicz, Michal
    Kapusta, Pawel
    Jackowska-Strumillo, Lidia
    Sankowski, Dominik
    PROCEEDINGS OF THE 2018 FEDERATED CONFERENCE ON COMPUTER SCIENCE AND INFORMATION SYSTEMS (FEDCSIS), 2018, : 347 - 350
  • [40] HPSM: A Programming Framework for Multi-CPU and Multi-GPU Systems
    Lima, Joao V. F.
    Di Domenico, Daniel
    2017 INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING WORKSHOPS (SBAC-PADW), 2017, : 31 - 36