Modeling and Optimizing the Scaling Performance in Distributed Deep Learning Training

被引：1

作者：

Liu, Ting ^{[1
,2
]}

Miao, Tianhao ^{[1
,2
]}

Wu, Qinghua ^{[1
,3
]}

Li, Zhenyu ^{[1
,3
]}

He, Guangxin ^{[1
,2
]}

Wu, Jiaoren ^{[4
]}

Zhang, Shengzhuo ^{[4
]}

Yang, Xingwu ^{[4
]}

Tyson, Gareth ^{[5
,6
]}

Xie, Gaogang ^{[2
,7
]}

机构：

[1] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China

[2] Univ Chinese Acad Sci, Beijing, Peoples R China

[3] Purple Mt Labs, Nanjing, Peoples R China

[4] Kuaishou, Beijing, Peoples R China

[5] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China

[6] Queen Mary Univ London, London, England

[7] Chinese Acad Sci, Comp Network Informat Ctr, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE ACM WEB CONFERENCE 2022 (WWW'22) | 2022年

基金：

北京市自然科学基金; 中国国家自然科学基金; 国家重点研发计划;

关键词：

distributed deep learning; scaling performance; performance modeling; tensor fusion; COMMUNICATION;

D O I：

10.1145/3485447.3511981

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Distributed Deep Learning (DDL) is widely used to accelerate deep neural network training for various Web applications. In each iteration of DDL training, each worker synchronizes neural network gradients with other workers. This introduces communication overhead and degrades the scaling performance. In this paper, we propose a recursive model, OSF (Scaling Factor considering Overlap), for estimating the scaling performance of DDL training of neural network models, given the settings of the DDL system. OSF captures two main characteristics of DDL training: the overlap between computation and communication, and the tensor fusion for batching updates. Measurements on a real-world DDL system show that OSF obtains a low estimation error (ranging from 0.5% to 8.4% for different models). Using OSF, we identify the factors that degrade the scaling performance, and propose solutions to effectively mitigate their impacts. Specifically, the proposed adaptive tensor fusion improves the scaling performance by 32.2%similar to 150% compared to the constant tensor fusion buffer size.

引用

页码：1764 / 1773

页数：10

共 50 条

[31] Optimizing Multi-GPU Parallelization Strategies for Deep Learning Training
Pal, Saptadeep
Ebrahimi, Eiman
Zulfiqar, Arslan
Fu, Yaosheng
Zhang, Victor
Migacz, Szymon
Nellans, David
Gupta, Puneet
IEEE MICRO, 2019, 39 (05) : 91 - 101
[32] Exploring Learning Rate Scaling Rules for Distributed ML Training on Transient Resources
Andre, Joel
Strati, Foteini
Klimovic, Ana
PROCEEDINGS OF THE 3RD INTERNATIONAL WORKSHOP ON DISTRIBUTED MACHINE LEARNING, DISTRIBUTEDML 2022, 2022, : 1 - 8
[33] Research on Asynchronous Distributed Deep Learning Technology-Optimizing Machine Learning Models in the Age of Distributed Data Storage
Niwa K.
Niwa, Kenta, 1600, Nippon Telegraph and Telephone Corp. (19): : 18 - 22
[34] A Generic Performance Model for Deep Learning in a Distributed Environment
Kavarakuntla, Tulasi
Han, Liangxiu
Lloyd, Huw
Latham, Annabel
Kleerekoper, Anthony
Akintoye, Samson B.
IEEE ACCESS, 2024, 12 : 8207 - 8219
[35] Performance and Consistency Analysis for Distributed Deep Learning Applications
Jia, Danlin
Saha, Manoj Pravakar
Bhimani, Janki
Mi, Ningfang
2020 IEEE 39TH INTERNATIONAL PERFORMANCE COMPUTING AND COMMUNICATIONS CONFERENCE (IPCCC), 2020,
[36] Distributed training strategies for a computer vision deep learning algorithm on a distributed GPU cluster
Campos, Victor
Sastre, Francesc
Yagues, Maurici
Bellver, Miriam
Giro-i-Nieto, Xavier
Torres, Jordi
INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE (ICCS 2017), 2017, 108 : 315 - 324
[37] A novel distributed deep learning training scheme based on distributed skip mesh list
Suzuki, Masaya
Mizutani, Kimihiro
IEICE COMMUNICATIONS EXPRESS, 2021, 10 (08): : 463 - 468
[38] GradientFlow: Optimizing Network Performance for Large-Scale Distributed DNN Training
Sun, Peng
Wen, Yonggang
Han, Ruobing
Feng, Wansen
Yan, Shengen
IEEE TRANSACTIONS ON BIG DATA, 2022, 8 (02) : 495 - 507
[39] Understanding and Optimizing the Performance of Distributed Machine Learning Applications on Apache Spark
Dunner, Celestine
Parnell, Thomas
Atasu, Kubilay
Sifalakis, Manolis
Pozidis, Haralampos
2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 331 - 338
[40] Optimizing Agent Training with Deep Q-Learning on a Self Driving Reinforcement Learning Environment
Rodrigues, Pedro
Vieira, Susana
2020 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (SSCI), 2020, : 745 - 752

← 1 2 3 4 5 →