Modeling and Optimizing the Scaling Performance in Distributed Deep Learning Training

被引:1
|
作者
Liu, Ting [1 ,2 ]
Miao, Tianhao [1 ,2 ]
Wu, Qinghua [1 ,3 ]
Li, Zhenyu [1 ,3 ]
He, Guangxin [1 ,2 ]
Wu, Jiaoren [4 ]
Zhang, Shengzhuo [4 ]
Yang, Xingwu [4 ]
Tyson, Gareth [5 ,6 ]
Xie, Gaogang [2 ,7 ]
机构
[1] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Beijing, Peoples R China
[3] Purple Mt Labs, Nanjing, Peoples R China
[4] Kuaishou, Beijing, Peoples R China
[5] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
[6] Queen Mary Univ London, London, England
[7] Chinese Acad Sci, Comp Network Informat Ctr, Beijing, Peoples R China
基金
北京市自然科学基金; 中国国家自然科学基金; 国家重点研发计划;
关键词
distributed deep learning; scaling performance; performance modeling; tensor fusion; COMMUNICATION;
D O I
10.1145/3485447.3511981
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Distributed Deep Learning (DDL) is widely used to accelerate deep neural network training for various Web applications. In each iteration of DDL training, each worker synchronizes neural network gradients with other workers. This introduces communication overhead and degrades the scaling performance. In this paper, we propose a recursive model, OSF (Scaling Factor considering Overlap), for estimating the scaling performance of DDL training of neural network models, given the settings of the DDL system. OSF captures two main characteristics of DDL training: the overlap between computation and communication, and the tensor fusion for batching updates. Measurements on a real-world DDL system show that OSF obtains a low estimation error (ranging from 0.5% to 8.4% for different models). Using OSF, we identify the factors that degrade the scaling performance, and propose solutions to effectively mitigate their impacts. Specifically, the proposed adaptive tensor fusion improves the scaling performance by 32.2%similar to 150% compared to the constant tensor fusion buffer size.
引用
收藏
页码:1764 / 1773
页数:10
相关论文
共 50 条
  • [1] Collective Communication Performance Evaluation for Distributed Deep Learning Training
    Lee, Sookwang
    Lee, Jaehwan
    APPLIED SCIENCES-BASEL, 2024, 14 (12):
  • [2] Modeling the Training Iteration Time for Heterogeneous Distributed Deep Learning Systems
    Zeng, Yifu
    Chen, Bowei
    Pan, Pulin
    Li, Kenli
    Chen, Guo
    INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 2023, 2023
  • [3] Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs
    Shi, Shaohuai
    Wang, Qiang
    Chu, Xiaowen
    2018 16TH IEEE INT CONF ON DEPENDABLE, AUTONOM AND SECURE COMP, 16TH IEEE INT CONF ON PERVAS INTELLIGENCE AND COMP, 4TH IEEE INT CONF ON BIG DATA INTELLIGENCE AND COMP, 3RD IEEE CYBER SCI AND TECHNOL CONGRESS (DASC/PICOM/DATACOM/CYBERSCITECH), 2018, : 949 - 957
  • [4] Performance Modeling and Scalability Optimization of Distributed Deep Learning Systems
    Yan, Feng
    Ruwase, Olatunji
    He, Yuxiong
    Chilimbi, Trishul
    KDD'15: PROCEEDINGS OF THE 21ST ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2015, : 1355 - 1364
  • [5] Performance Modeling and Analysis of Distributed Deep Neural Network Training with Parameter Server
    Zhang, Xuan
    Zhang, Jiao
    Wei, Dehui
    Pan, Tian
    Huang, Tao
    IEEE CONFERENCE ON GLOBAL COMMUNICATIONS, GLOBECOM, 2023, : 4140 - 4145
  • [6] MD-Roofline: A Training Performance Analysis Model for Distributed Deep Learning
    Miao, Tianhao
    Wu, Qinghua
    Liu, Ting
    Cui, Penglai
    Ren, Rui
    Li, Zhenyu
    Xie, Gaogang
    2022 27TH IEEE SYMPOSIUM ON COMPUTERS AND COMMUNICATIONS (IEEE ISCC 2022), 2022,
  • [7] AIACC-Training: Optimizing Distributed Deep Learning Training through Multi-streamed and Concurrent Gradient Communications
    Lin, Lixiang
    Qiu, Shenghao
    Yu, Ziqi
    You, Liang
    Xin, Long
    Sun, Xiaoyang
    Xu, Jie
    Wang, Zheng
    2022 IEEE 42ND INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS 2022), 2022, : 853 - 863
  • [8] Survey on Task Scheduling of Deep Learning Training Based on Performance Modeling
    Yang, Zi-Chao
    Wu, Heng
    Wu, Yue-Wen
    Zhang, Wen-Bo
    Ruan Jian Xue Bao/Journal of Software, 2025, 36 (04): : 1570 - 1589
  • [9] Survey on Network of Distributed Deep Learning Training
    Zhu H.
    Yuan G.
    Yao C.
    Tan G.
    Wang Z.
    Hu Z.
    Zhang X.
    An X.
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2021, 58 (01): : 98 - 115
  • [10] Selective Preemption of Distributed Deep Learning Training
    Go, Younghun
    Shin, Changyong
    Lee, Jeunghwan
    Yoo, Yeonho
    Yang, Gyeongsik
    Yoo, Chuck
    2023 IEEE 16TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING, CLOUD, 2023, : 175 - 177