Modeling and Optimizing the Scaling Performance in Distributed Deep Learning Training

被引：1

作者：

Liu, Ting ^{[1
,2
]}

Miao, Tianhao ^{[1
,2
]}

Wu, Qinghua ^{[1
,3
]}

Li, Zhenyu ^{[1
,3
]}

He, Guangxin ^{[1
,2
]}

Wu, Jiaoren ^{[4
]}

Zhang, Shengzhuo ^{[4
]}

Yang, Xingwu ^{[4
]}

Tyson, Gareth ^{[5
,6
]}

Xie, Gaogang ^{[2
,7
]}

机构：

[1] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China

[2] Univ Chinese Acad Sci, Beijing, Peoples R China

[3] Purple Mt Labs, Nanjing, Peoples R China

[4] Kuaishou, Beijing, Peoples R China

[5] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China

[6] Queen Mary Univ London, London, England

[7] Chinese Acad Sci, Comp Network Informat Ctr, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE ACM WEB CONFERENCE 2022 (WWW'22) | 2022年

基金：

北京市自然科学基金; 中国国家自然科学基金; 国家重点研发计划;

关键词：

distributed deep learning; scaling performance; performance modeling; tensor fusion; COMMUNICATION;

D O I：

10.1145/3485447.3511981

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Distributed Deep Learning (DDL) is widely used to accelerate deep neural network training for various Web applications. In each iteration of DDL training, each worker synchronizes neural network gradients with other workers. This introduces communication overhead and degrades the scaling performance. In this paper, we propose a recursive model, OSF (Scaling Factor considering Overlap), for estimating the scaling performance of DDL training of neural network models, given the settings of the DDL system. OSF captures two main characteristics of DDL training: the overlap between computation and communication, and the tensor fusion for batching updates. Measurements on a real-world DDL system show that OSF obtains a low estimation error (ranging from 0.5% to 8.4% for different models). Using OSF, we identify the factors that degrade the scaling performance, and propose solutions to effectively mitigate their impacts. Specifically, the proposed adaptive tensor fusion improves the scaling performance by 32.2%similar to 150% compared to the constant tensor fusion buffer size.

引用

页码：1764 / 1773

页数：10

共 50 条

[41] Distributed Framework for Accelerating Training of Deep Learning Models through Prioritization
Zhou, Tian
Gao, Lixin
2021 IEEE INTERNATIONAL CONFERENCE ON CLOUD ENGINEERING, IC2E 2021, 2021, : 201 - 209
[42] Efficient Flow Scheduling in Distributed Deep Learning Training with Echelon Formation
Pan, Rui
Lei, Yiming
Li, Jialong
Xie, Zhiqiang
Yuan, Binhang
Xia, Yiting
THE 21ST ACM WORKSHOP ON HOT TOPICS IN NETWORKS, HOTNETS 2022, 2022, : 93 - 100
[43] Leader Stochastic Gradient Descent for Distributed Training of Deep Learning Models
Teng, Yunfei
Gao, Wenbo
Chalus, Francois
Choromanska, Anna
Goldfarb, Donald
Weller, Adrian
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
[44] RedSync: Reducing synchronization bandwidth for distributed deep learning training system
Fang, Jiarui
Fu, Haohuan
Yang, Guangwen
Hsieh, Cho-Jui
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2019, 133 : 30 - 39
[45] Exploring the Effects of Silent Data Corruption in Distributed Deep Learning Training
Rojas, Elvis
Perez, Diego
Meneses, Esteban
2022 IEEE 34TH INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING (SBAC-PAD 2022), 2022, : 21 - 30
[46] BK.Synapse: A scalable distributed training framework for deep learning
Dinh Viet Sang
Phan Ngoc Lan
SOICT 2019: PROCEEDINGS OF THE TENTH INTERNATIONAL SYMPOSIUM ON INFORMATION AND COMMUNICATION TECHNOLOGY, 2019, : 43 - 48
[47] Deployment Service for Scalable Distributed Deep Learning Training on Multiple Clouds
Jorge, Javier
Molto, German
Segrelles, Damian
Fontes, Joao Pedro
Guevara, Miguel Angel
CLOSER: PROCEEDINGS OF THE 11TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND SERVICES SCIENCE, 2021, : 135 - 142
[48] Distributed deep learning training using silicon photonic switched architectures
Zhu, Ziyi
Teh, Min Yee
Wu, Zhenguo
Glick, Madeleine Strom
Yan, Shijia
Hattink, Maarten
Bergman, Keren
APL PHOTONICS, 2022, 7 (03)
[49] An Empirical Study of Distributed Deep Learning Training on Edge (Student Abstract)
Mwase, Christine
Kahira, Albert Njoroge
Zou, Zhuo
THIRTY-EIGTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 21, 2024, : 23590 - 23591
[50] Optimizing on -demand GPUs in the Cloud for Deep Learning App ica,,ions Training
Jahani, Arezoo
Lattuada, Marco
Ciavotta, Michele
Ardagna, Danilo
Amaldi, Edoardo
Zhang, Li
2019 4TH INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATIONS AND SECURITY (ICCCS), 2019,

← 1 2 3 4 5 →