PARING: Joint Task Placement and Routing for Distributed Training With In-Network Aggregation

被引:0
|
作者
Qiu, Yuhang [1 ,2 ]
Zhao, Gongming [1 ,2 ]
Xu, Hongli [1 ,2 ]
Huang, He [3 ]
Qiao, Chunming [4 ]
机构
[1] Univ Sci & Technol China, Sch Comp Sci & Technol, Hefei 230027, Anhui, Peoples R China
[2] Univ Sci & Technol China, Suzhou Inst Adv Res, Suzhou 215123, Jiangsu, Peoples R China
[3] Soochow Univ, Sch Comp Sci & Technol, Suzhou 215123, Jiangsu, Peoples R China
[4] Univ Buffalo, Dept Comp Sci & Engn, Buffalo, NY 16260 USA
基金
美国国家科学基金会;
关键词
Task analysis; Servers; Routing; Training; Aggregates; Topology; Switches; In-network aggregation; distributed training; task placement; gradient routing;
D O I
10.1109/TNET.2024.3414853
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
With the increase in both the model size and dataset size of distributed training (DT) tasks, communication between the workers and parameter servers (PSs) in a cluster has become a bottleneck. In-network aggregation (INA) enabled by programmable switches has been proposed as a promising solution to alleviate the communication bottleneck. However, existing works focused on in-network aggregation implementation based on simple DT placement and fixed routing policies, which may lead to a large communication overhead and inefficient use of resources (e.g., storage, computing power and bandwidth). In this paper, we propose PARING, the first-of-its-kind INA approach that jointly optimizes DT task placement and routing in order to reduce traffic volume and minimize communication time. We formulate the problem as a nonlinear multi-objective mixed-integer programming problem, and prove its NP-Hardness. Based on the concept of Steiner trees, an algorithm with bounded approximation factors is proposed for this problem. Large-scale simulations show that our algorithm can reduce communication time by up to 81.0% and traffic volume by up to 19.1% compared to the state-of-the-art algorithms.
引用
收藏
页码:4317 / 4332
页数:16
相关论文
共 50 条
  • [1] GRID: Gradient Routing With In-Network Aggregation for Distributed Training
    Fang, Jin
    Zhao, Gongming
    Xu, Hongli
    Wu, Changbo
    Yu, Zhuolong
    IEEE-ACM TRANSACTIONS ON NETWORKING, 2023, 31 (05) : 2267 - 2280
  • [2] InGo: In-Network Aggregation Routing with Batch Size Adjustment for Distributed Training
    Bao, Jianfeng
    Zhao, Gongming
    Xu, Hongli
    Wang, Haibo
    Yang, Peng
    2024 IEEE/ACM 32ND INTERNATIONAL SYMPOSIUM ON QUALITY OF SERVICE, IWQOS, 2024,
  • [3] Accelerating Distributed Training With Collaborative In-Network Aggregation
    Fang, Jin
    Xu, Hongli
    Zhao, Gongming
    Yu, Zhuolong
    Shen, Bingchen
    Xie, Liguang
    IEEE-ACM TRANSACTIONS ON NETWORKING, 2024, 32 (04) : 3437 - 3452
  • [4] Training Job Placement in Clusters with Statistical In-Network Aggregation
    Zhao, Bohan
    Xu, Wei
    Liu, Shuo
    Tian, Yang
    Wang, Qiaoling
    Wu, Wenfei
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS, ASPLOS 2024, VOL 1, 2024, : 420 - 434
  • [5] Maximizing Aggregation Throughput for Distributed Training with Constrained In-Network Computing
    Luo, Long
    Yang, Shulin
    Wu, Hao
    Yu, Hongfang
    Lei, Bo
    Gao, Shuai
    ICC 2023-IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS, 2023, : 3652 - 3657
  • [6] Determining the routing path for in-network aggregation
    Zhao, Xiwei
    Makki, S. Kami
    Pissinou, Niki
    2006 INTERNATIONAL CONFERENCE ON HYBRID INFORMATION TECHNOLOGY, VOL 2, PROCEEDINGS, 2006, : 318 - +
  • [7] Scaling Distributed Machine Learning with In-Network Aggregation
    Sapio, Amedeo
    Canini, Marco
    Ho, Chen-Yu
    Nelson, Jacob
    Kalnis, Panos
    Kim, Changhoon
    Krishnamurthy, Arvind
    Moshref, Masoud
    Ports, Dan R. K.
    Richtarik, Peter
    PROCEEDINGS OF THE 18TH USENIX SYMPOSIUM ON NETWORKED SYSTEM DESIGN AND IMPLEMENTATION, 2021, : 785 - 808
  • [8] Fuzzy routing for in-network aggregation in wireless sensor networks
    Maivizhi, Radhakrishnan
    Yogesh, Palanichamy
    PEER-TO-PEER NETWORKING AND APPLICATIONS, 2022, 15 (01) : 592 - 611
  • [9] In-network event routing approach based on aggregation ring
    School of Software, Central South University, Changsha
    410083, China
    不详
    410083, China
    Zhongnan Daxue Xuebao (Ziran Kexue Ban), 11 (4100-4107):
  • [10] Fuzzy routing for in-network aggregation in wireless sensor networks
    Radhakrishnan Maivizhi
    Palanichamy Yogesh
    Peer-to-Peer Networking and Applications, 2022, 15 : 592 - 611