Towards an optimized distributed deep learning framework for a heterogeneous multi-GPU cluster

被引：17

作者：

Kim, Youngrang ^{[1
]}

Choi, Hyeonseong ^{[1
]}

Lee, Jaehwan ^{[1
]}

Kim, Jik-Soo ^{[2
]}

Jei, Hyunseung ^{[3
]}

Roh, Hongchan ^{[3
]}

机构：

[1] Korea Aerosp Univ, Goyang Si, South Korea

[2] Myongji Univ, Yongin, South Korea

[3] SK Telecom ML Infra Lab, Seongnam Si, South Korea

来源：

CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS | 2020年 / 23卷 / 03期

基金：

新加坡国家研究基金会;

关键词：

Data parallel; Distributed deep learning; Heterogeneous cluster; Large-scale deep learning;

D O I：

10.1007/s10586-020-03144-9

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

This paper presents a novel "Distributed Deep Learning Framework" for aheterogeneousmulti-GPU cluster that can effectively improve overall resource utilization without sacrificing training accuracy. Specifically, we employ a hybrid aggregation approach using a parameter-server and all-reduce schemes in order to address potential performance degradation problems in running deep learning applications on a heterogeneous computing system. In addition, we design and implement an asynchronous large mini-batch training mechanism to maintain training accuracy for asynchronous data-paralleled deep learning processing with enhanced collective communication capability based on MPI. We successfully implement our proposed framework on TensorFlow and perform extensive experiments in both of homogeneous and heterogeneous computing systems. Evaluation results show that our proposed framework can improve computing performance by decreasing I/O bottlenecks, and effectively increasing the resource utilization in the heterogeneous multi-GPU cluster.

引用

页码：2287 / 2300

页数：14

共 50 条

[31] MARBLE: A Multi-GPU Aware Job Scheduler for Deep Learning on HPC Systems
Han, Jingoo
Rafique, M. Mustafa
Xu, Luna
Butt, Ali R.
Lim, Seung-Hwan
Vazhkudai, Sudharshan S.
2020 20TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2020), 2020, : 272 - 281
[32] Analysis of large deviations behavior of multi-GPU memory access in deep learning
Tamizharasan, P. S.
Ramasubramanian, N.
JOURNAL OF SUPERCOMPUTING, 2018, 74 (05): : 2199 - 2212
[33] Analysis of large deviations behavior of multi-GPU memory access in deep learning
P. S. Tamizharasan
N. Ramasubramanian
The Journal of Supercomputing, 2018, 74 : 2199 - 2212
[34] CROSSBOW: Scaling Deep Learning with Small Batch Sizes on Multi-GPU Servers
Koliousis, Alexandros
Watcharapichat, Pijika
Weidlich, Matthias
Mai, Luo
Costa, Paolo
Pietzuch, Peter
PROCEEDINGS OF THE VLDB ENDOWMENT, 2019, 12 (11): : 1399 - 1413
[35] A Distributed Multi-GPU System for Fast Graph Processing
Jia, Zhihao
Kwon, Yongkee
Shipman, Galen
McCormick, Pat
Erez, Mattan
Aiken, Alex
PROCEEDINGS OF THE VLDB ENDOWMENT, 2017, 11 (03): : 297 - 310
[36] Towards Energy-Efficient Real-Time Scheduling of Heterogeneous Multi-GPU Systems
Wang, Yidi
Karimi, Mohsen
Kim, Hyoseung
2022 IEEE 43RD REAL-TIME SYSTEMS SYMPOSIUM (RTSS 2022), 2022, : 409 - 421
[37] Distributed training strategies for a computer vision deep learning algorithm on a distributed GPU cluster
Campos, Victor
Sastre, Francesc
Yagues, Maurici
Bellver, Miriam
Giro-i-Nieto, Xavier
Torres, Jordi
INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE (ICCS 2017), 2017, 108 : 315 - 324
[38] Distributed Join Algorithms on Multi-GPU Clusters with GPUDirect RDMA
Guo, Chengxin
Chen, Hong
Zhang, Feng
Li, Cuiping
PROCEEDINGS OF THE 48TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP 2019), 2019,
[39] Acceleration of 3D ECT image reconstruction in heterogeneous, multi-GPU, multi-node distributed system
Majchrowicz, Michal
Kapusta, Pawel
Jackowska-Strumillo, Lidia
Sankowski, Dominik
PROCEEDINGS OF THE 2018 FEDERATED CONFERENCE ON COMPUTER SCIENCE AND INFORMATION SYSTEMS (FEDCSIS), 2018, : 347 - 350
[40] HPSM: A Programming Framework for Multi-CPU and Multi-GPU Systems
Lima, Joao V. F.
Di Domenico, Daniel
2017 INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING WORKSHOPS (SBAC-PADW), 2017, : 31 - 36

← 1 2 3 4 5 →