SDPIPE: A Semi-Decentralized Framework for Heterogeneity-aware Pipeline-parallel Training

被引:5
|
作者
Miao, Xupeng [1 ]
Shi, Yining [2 ]
Yang, Zhi [2 ]
Cui, Bin [2 ]
Jia, Zhihao [1 ]
机构
[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
[2] Peking Univ, Beijing, Peoples R China
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2023年 / 16卷 / 09期
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
ALGORITHMS;
D O I
10.14778/3598581.3598604
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The increasing size of both deep learning models and training data necessitates the ability to scale out model training through pipeline-parallel training, which combines pipelined model parallelism and data parallelism. However, most of them assume an ideal homogeneous dedicated cluster. As for real cloud clusters, these approaches su.er from the intensive model synchronization overheads due to the dynamic environment heterogeneity. Such a huge challenge leaves the design in a dilemma: either the performance bottleneck of the central parameter server (PS) or severe performance degradation caused by stragglers for decentralized synchronization (like All-Reduce). This approach presents SDPIPE, a new semi-decentralized framework to get the best of both worlds, achieving both high heterogeneity tolerance and convergence e.ciency in pipeline-parallel training. To provide high performance, we decentralize the communication model synchronization, which accounts for the largest proportion of synchronization overhead. In contrast, we centralize the process of group scheduling, which is lightweight but needs a global view for better performance and convergence speed against heterogeneity. We show via a prototype implementation the signi.cant advantage of SDP... on performance and scalability, facing di.erent environments.
引用
收藏
页码:2354 / 2363
页数:10
相关论文
共 35 条
  • [21] AutoPipe-H: A Heterogeneity-Aware Data-Paralleled Pipeline Approach on Commodity GPU Servers
    Liu, Weijie
    Lu, Kai
    Lai, Zhiquan
    Li, Shengwei
    Ge, Keshi
    Li, Dongsheng
    Lu, Xicheng
    IEEE TRANSACTIONS ON COMPUTERS, 2025, 74 (04) : 1196 - 1209
  • [22] HARL: Optimizing Parallel File Systems with Heterogeneity-Aware Region-Level Data Layout
    He, Shuibing
    Wang, Yang
    Sun, Xian-He
    Xu, Chengzhong
    IEEE TRANSACTIONS ON COMPUTERS, 2017, 66 (06) : 1048 - 1060
  • [23] Design and Implementation of a Criticality- and Heterogeneity-Aware Runtime System for Task-Parallel Applications
    Han, Myeonggyun
    Park, Jinsu
    Baek, Woongki
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2021, 32 (05) : 1117 - 1132
  • [24] FedCure: A Heterogeneity-Aware Personalized Federated Learning Framework for Intelligent Healthcare Applications in IoMT Environments
    Sachin, D. N.
    Annappa, B.
    Hegde, Saumya
    Abhijit, Chunduru Sri
    Ambesange, Sateesh
    IEEE ACCESS, 2024, 12 : 15867 - 15883
  • [25] On the impact of heterogeneity-aware mesh partitioning and non-contributing computation removal on parallel reservoir simulations
    Andreas Thune
    Xing Cai
    Alf Birger Rustad
    Journal of Mathematics in Industry, 11
  • [26] Heterogeneity-Aware Collective I/O for Parallel I/O Systems with Hybrid HDD/SSD Servers
    He, Shuibing
    Wang, Yang
    Sun, Xian-He
    Huang, Chuanhe
    Xu, Chenzhong
    IEEE TRANSACTIONS ON COMPUTERS, 2017, 66 (06) : 1091 - 1098
  • [27] On the impact of heterogeneity-aware mesh partitioning and non-contributing computation removal on parallel reservoir simulations
    Thune, Andreas
    Cai, Xing
    Rustad, Alf Birger
    JOURNAL OF MATHEMATICS IN INDUSTRY, 2021, 11 (01)
  • [28] Connectivity-Aware Semi-Decentralized Federated Learning over Time-Varying D2D Networks
    Parasnis, Rohit
    Hosseinalipour, Seyyedali
    Chu, Yun-Wei
    Chiang, Mung
    Brinton, Christopher G.
    PROCEEDINGS OF THE 2023 INTERNATIONAL SYMPOSIUM ON THEORY, ALGORITHMIC FOUNDATIONS, AND PROTOCOL DESIGN FOR MOBILE NETWORKS AND MOBILE COMPUTING, MOBIHOC 2023, 2023, : 31 - 40
  • [29] SAPipe: Staleness-Aware Pipeline for Data-Parallel DNN Training
    Chen, Yangrui
    Xie, Cong
    Ma, Meng
    Gu, Juncheng
    Peng, Yanghua
    Lin, Haibin
    Wu, Chuan
    Zhu, Yibo
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [30] A semi-decentralized framework for simultaneous expansion planning of privately owned multi-regional energy systems and sub-transmission grid
    Navidi, Mohammad
    Moghaddas-Tafreshi, Seyed Masoud
    Alishvandi, Amir Mohammad
    INTERNATIONAL JOURNAL OF ELECTRICAL POWER & ENERGY SYSTEMS, 2021, 128