Modeling and Optimizing the Scaling Performance in Distributed Deep Learning Training

被引:1
|
作者
Liu, Ting [1 ,2 ]
Miao, Tianhao [1 ,2 ]
Wu, Qinghua [1 ,3 ]
Li, Zhenyu [1 ,3 ]
He, Guangxin [1 ,2 ]
Wu, Jiaoren [4 ]
Zhang, Shengzhuo [4 ]
Yang, Xingwu [4 ]
Tyson, Gareth [5 ,6 ]
Xie, Gaogang [2 ,7 ]
机构
[1] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Beijing, Peoples R China
[3] Purple Mt Labs, Nanjing, Peoples R China
[4] Kuaishou, Beijing, Peoples R China
[5] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
[6] Queen Mary Univ London, London, England
[7] Chinese Acad Sci, Comp Network Informat Ctr, Beijing, Peoples R China
基金
北京市自然科学基金; 中国国家自然科学基金; 国家重点研发计划;
关键词
distributed deep learning; scaling performance; performance modeling; tensor fusion; COMMUNICATION;
D O I
10.1145/3485447.3511981
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Distributed Deep Learning (DDL) is widely used to accelerate deep neural network training for various Web applications. In each iteration of DDL training, each worker synchronizes neural network gradients with other workers. This introduces communication overhead and degrades the scaling performance. In this paper, we propose a recursive model, OSF (Scaling Factor considering Overlap), for estimating the scaling performance of DDL training of neural network models, given the settings of the DDL system. OSF captures two main characteristics of DDL training: the overlap between computation and communication, and the tensor fusion for batching updates. Measurements on a real-world DDL system show that OSF obtains a low estimation error (ranging from 0.5% to 8.4% for different models). Using OSF, we identify the factors that degrade the scaling performance, and propose solutions to effectively mitigate their impacts. Specifically, the proposed adaptive tensor fusion improves the scaling performance by 32.2%similar to 150% compared to the constant tensor fusion buffer size.
引用
收藏
页码:1764 / 1773
页数:10
相关论文
共 50 条
  • [41] Distributed Framework for Accelerating Training of Deep Learning Models through Prioritization
    Zhou, Tian
    Gao, Lixin
    2021 IEEE INTERNATIONAL CONFERENCE ON CLOUD ENGINEERING, IC2E 2021, 2021, : 201 - 209
  • [42] Efficient Flow Scheduling in Distributed Deep Learning Training with Echelon Formation
    Pan, Rui
    Lei, Yiming
    Li, Jialong
    Xie, Zhiqiang
    Yuan, Binhang
    Xia, Yiting
    THE 21ST ACM WORKSHOP ON HOT TOPICS IN NETWORKS, HOTNETS 2022, 2022, : 93 - 100
  • [43] Leader Stochastic Gradient Descent for Distributed Training of Deep Learning Models
    Teng, Yunfei
    Gao, Wenbo
    Chalus, Francois
    Choromanska, Anna
    Goldfarb, Donald
    Weller, Adrian
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [44] RedSync: Reducing synchronization bandwidth for distributed deep learning training system
    Fang, Jiarui
    Fu, Haohuan
    Yang, Guangwen
    Hsieh, Cho-Jui
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2019, 133 : 30 - 39
  • [45] Exploring the Effects of Silent Data Corruption in Distributed Deep Learning Training
    Rojas, Elvis
    Perez, Diego
    Meneses, Esteban
    2022 IEEE 34TH INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING (SBAC-PAD 2022), 2022, : 21 - 30
  • [46] BK.Synapse: A scalable distributed training framework for deep learning
    Dinh Viet Sang
    Phan Ngoc Lan
    SOICT 2019: PROCEEDINGS OF THE TENTH INTERNATIONAL SYMPOSIUM ON INFORMATION AND COMMUNICATION TECHNOLOGY, 2019, : 43 - 48
  • [47] Deployment Service for Scalable Distributed Deep Learning Training on Multiple Clouds
    Jorge, Javier
    Molto, German
    Segrelles, Damian
    Fontes, Joao Pedro
    Guevara, Miguel Angel
    CLOSER: PROCEEDINGS OF THE 11TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND SERVICES SCIENCE, 2021, : 135 - 142
  • [48] Distributed deep learning training using silicon photonic switched architectures
    Zhu, Ziyi
    Teh, Min Yee
    Wu, Zhenguo
    Glick, Madeleine Strom
    Yan, Shijia
    Hattink, Maarten
    Bergman, Keren
    APL PHOTONICS, 2022, 7 (03)
  • [49] An Empirical Study of Distributed Deep Learning Training on Edge (Student Abstract)
    Mwase, Christine
    Kahira, Albert Njoroge
    Zou, Zhuo
    THIRTY-EIGTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 21, 2024, : 23590 - 23591
  • [50] Optimizing on -demand GPUs in the Cloud for Deep Learning App ica,,ions Training
    Jahani, Arezoo
    Lattuada, Marco
    Ciavotta, Michele
    Ardagna, Danilo
    Amaldi, Edoardo
    Zhang, Li
    2019 4TH INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATIONS AND SECURITY (ICCCS), 2019,