Modeling and Optimizing the Scaling Performance in Distributed Deep Learning Training

被引：1

作者：

Liu, Ting ^{[1
,2
]}

Miao, Tianhao ^{[1
,2
]}

Wu, Qinghua ^{[1
,3
]}

Li, Zhenyu ^{[1
,3
]}

He, Guangxin ^{[1
,2
]}

Wu, Jiaoren ^{[4
]}

Zhang, Shengzhuo ^{[4
]}

Yang, Xingwu ^{[4
]}

Tyson, Gareth ^{[5
,6
]}

Xie, Gaogang ^{[2
,7
]}

机构：

[1] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China

[2] Univ Chinese Acad Sci, Beijing, Peoples R China

[3] Purple Mt Labs, Nanjing, Peoples R China

[4] Kuaishou, Beijing, Peoples R China

[5] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China

[6] Queen Mary Univ London, London, England

[7] Chinese Acad Sci, Comp Network Informat Ctr, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE ACM WEB CONFERENCE 2022 (WWW'22) | 2022年

基金：

北京市自然科学基金; 中国国家自然科学基金; 国家重点研发计划;

关键词：

distributed deep learning; scaling performance; performance modeling; tensor fusion; COMMUNICATION;

D O I：

10.1145/3485447.3511981

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Distributed Deep Learning (DDL) is widely used to accelerate deep neural network training for various Web applications. In each iteration of DDL training, each worker synchronizes neural network gradients with other workers. This introduces communication overhead and degrades the scaling performance. In this paper, we propose a recursive model, OSF (Scaling Factor considering Overlap), for estimating the scaling performance of DDL training of neural network models, given the settings of the DDL system. OSF captures two main characteristics of DDL training: the overlap between computation and communication, and the tensor fusion for batching updates. Measurements on a real-world DDL system show that OSF obtains a low estimation error (ranging from 0.5% to 8.4% for different models). Using OSF, we identify the factors that degrade the scaling performance, and propose solutions to effectively mitigate their impacts. Specifically, the proposed adaptive tensor fusion improves the scaling performance by 32.2%similar to 150% compared to the constant tensor fusion buffer size.

引用

页码：1764 / 1773

页数：10

共 50 条

[1] Collective Communication Performance Evaluation for Distributed Deep Learning Training
Lee, Sookwang
Lee, Jaehwan
APPLIED SCIENCES-BASEL, 2024, 14 (12):
[2] Modeling the Training Iteration Time for Heterogeneous Distributed Deep Learning Systems
Zeng, Yifu
Chen, Bowei
Pan, Pulin
Li, Kenli
Chen, Guo
INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 2023, 2023
[3] Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs
Shi, Shaohuai
Wang, Qiang
Chu, Xiaowen
2018 16TH IEEE INT CONF ON DEPENDABLE, AUTONOM AND SECURE COMP, 16TH IEEE INT CONF ON PERVAS INTELLIGENCE AND COMP, 4TH IEEE INT CONF ON BIG DATA INTELLIGENCE AND COMP, 3RD IEEE CYBER SCI AND TECHNOL CONGRESS (DASC/PICOM/DATACOM/CYBERSCITECH), 2018, : 949 - 957
[4] Performance Modeling and Scalability Optimization of Distributed Deep Learning Systems
Yan, Feng
Ruwase, Olatunji
He, Yuxiong
Chilimbi, Trishul
KDD'15: PROCEEDINGS OF THE 21ST ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2015, : 1355 - 1364
[5] Performance Modeling and Analysis of Distributed Deep Neural Network Training with Parameter Server
Zhang, Xuan
Zhang, Jiao
Wei, Dehui
Pan, Tian
Huang, Tao
IEEE CONFERENCE ON GLOBAL COMMUNICATIONS, GLOBECOM, 2023, : 4140 - 4145
[6] MD-Roofline: A Training Performance Analysis Model for Distributed Deep Learning
Miao, Tianhao
Wu, Qinghua
Liu, Ting
Cui, Penglai
Ren, Rui
Li, Zhenyu
Xie, Gaogang
2022 27TH IEEE SYMPOSIUM ON COMPUTERS AND COMMUNICATIONS (IEEE ISCC 2022), 2022,
[7] AIACC-Training: Optimizing Distributed Deep Learning Training through Multi-streamed and Concurrent Gradient Communications
Lin, Lixiang
Qiu, Shenghao
Yu, Ziqi
You, Liang
Xin, Long
Sun, Xiaoyang
Xu, Jie
Wang, Zheng
2022 IEEE 42ND INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS 2022), 2022, : 853 - 863
[8] Survey on Task Scheduling of Deep Learning Training Based on Performance Modeling
Yang, Zi-Chao
Wu, Heng
Wu, Yue-Wen
Zhang, Wen-Bo
Ruan Jian Xue Bao/Journal of Software, 2025, 36 (04): : 1570 - 1589
[9] Survey on Network of Distributed Deep Learning Training
Zhu H.
Yuan G.
Yao C.
Tan G.
Wang Z.
Hu Z.
Zhang X.
An X.
Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2021, 58 (01): : 98 - 115
[10] Selective Preemption of Distributed Deep Learning Training
Go, Younghun
Shin, Changyong
Lee, Jeunghwan
Yoo, Yeonho
Yang, Gyeongsik
Yoo, Chuck
2023 IEEE 16TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING, CLOUD, 2023, : 175 - 177

← 1 2 3 4 5 →