Modeling and Optimizing the Scaling Performance in Distributed Deep Learning Training

被引:1
|
作者
Liu, Ting [1 ,2 ]
Miao, Tianhao [1 ,2 ]
Wu, Qinghua [1 ,3 ]
Li, Zhenyu [1 ,3 ]
He, Guangxin [1 ,2 ]
Wu, Jiaoren [4 ]
Zhang, Shengzhuo [4 ]
Yang, Xingwu [4 ]
Tyson, Gareth [5 ,6 ]
Xie, Gaogang [2 ,7 ]
机构
[1] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Beijing, Peoples R China
[3] Purple Mt Labs, Nanjing, Peoples R China
[4] Kuaishou, Beijing, Peoples R China
[5] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
[6] Queen Mary Univ London, London, England
[7] Chinese Acad Sci, Comp Network Informat Ctr, Beijing, Peoples R China
基金
北京市自然科学基金; 中国国家自然科学基金; 国家重点研发计划;
关键词
distributed deep learning; scaling performance; performance modeling; tensor fusion; COMMUNICATION;
D O I
10.1145/3485447.3511981
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Distributed Deep Learning (DDL) is widely used to accelerate deep neural network training for various Web applications. In each iteration of DDL training, each worker synchronizes neural network gradients with other workers. This introduces communication overhead and degrades the scaling performance. In this paper, we propose a recursive model, OSF (Scaling Factor considering Overlap), for estimating the scaling performance of DDL training of neural network models, given the settings of the DDL system. OSF captures two main characteristics of DDL training: the overlap between computation and communication, and the tensor fusion for batching updates. Measurements on a real-world DDL system show that OSF obtains a low estimation error (ranging from 0.5% to 8.4% for different models). Using OSF, we identify the factors that degrade the scaling performance, and propose solutions to effectively mitigate their impacts. Specifically, the proposed adaptive tensor fusion improves the scaling performance by 32.2%similar to 150% compared to the constant tensor fusion buffer size.
引用
收藏
页码:1764 / 1773
页数:10
相关论文
共 50 条
  • [31] Optimizing Multi-GPU Parallelization Strategies for Deep Learning Training
    Pal, Saptadeep
    Ebrahimi, Eiman
    Zulfiqar, Arslan
    Fu, Yaosheng
    Zhang, Victor
    Migacz, Szymon
    Nellans, David
    Gupta, Puneet
    IEEE MICRO, 2019, 39 (05) : 91 - 101
  • [32] Exploring Learning Rate Scaling Rules for Distributed ML Training on Transient Resources
    Andre, Joel
    Strati, Foteini
    Klimovic, Ana
    PROCEEDINGS OF THE 3RD INTERNATIONAL WORKSHOP ON DISTRIBUTED MACHINE LEARNING, DISTRIBUTEDML 2022, 2022, : 1 - 8
  • [33] Research on Asynchronous Distributed Deep Learning Technology-Optimizing Machine Learning Models in the Age of Distributed Data Storage
    Niwa K.
    Niwa, Kenta, 1600, Nippon Telegraph and Telephone Corp. (19): : 18 - 22
  • [34] A Generic Performance Model for Deep Learning in a Distributed Environment
    Kavarakuntla, Tulasi
    Han, Liangxiu
    Lloyd, Huw
    Latham, Annabel
    Kleerekoper, Anthony
    Akintoye, Samson B.
    IEEE ACCESS, 2024, 12 : 8207 - 8219
  • [35] Performance and Consistency Analysis for Distributed Deep Learning Applications
    Jia, Danlin
    Saha, Manoj Pravakar
    Bhimani, Janki
    Mi, Ningfang
    2020 IEEE 39TH INTERNATIONAL PERFORMANCE COMPUTING AND COMMUNICATIONS CONFERENCE (IPCCC), 2020,
  • [36] Distributed training strategies for a computer vision deep learning algorithm on a distributed GPU cluster
    Campos, Victor
    Sastre, Francesc
    Yagues, Maurici
    Bellver, Miriam
    Giro-i-Nieto, Xavier
    Torres, Jordi
    INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE (ICCS 2017), 2017, 108 : 315 - 324
  • [37] A novel distributed deep learning training scheme based on distributed skip mesh list
    Suzuki, Masaya
    Mizutani, Kimihiro
    IEICE COMMUNICATIONS EXPRESS, 2021, 10 (08): : 463 - 468
  • [38] GradientFlow: Optimizing Network Performance for Large-Scale Distributed DNN Training
    Sun, Peng
    Wen, Yonggang
    Han, Ruobing
    Feng, Wansen
    Yan, Shengen
    IEEE TRANSACTIONS ON BIG DATA, 2022, 8 (02) : 495 - 507
  • [39] Understanding and Optimizing the Performance of Distributed Machine Learning Applications on Apache Spark
    Dunner, Celestine
    Parnell, Thomas
    Atasu, Kubilay
    Sifalakis, Manolis
    Pozidis, Haralampos
    2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 331 - 338
  • [40] Optimizing Agent Training with Deep Q-Learning on a Self Driving Reinforcement Learning Environment
    Rodrigues, Pedro
    Vieira, Susana
    2020 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (SSCI), 2020, : 745 - 752