PARING: Joint Task Placement and Routing for Distributed Training With In-Network Aggregation

被引:0
|
作者
Qiu, Yuhang [1 ,2 ]
Zhao, Gongming [1 ,2 ]
Xu, Hongli [1 ,2 ]
Huang, He [3 ]
Qiao, Chunming [4 ]
机构
[1] Univ Sci & Technol China, Sch Comp Sci & Technol, Hefei 230027, Anhui, Peoples R China
[2] Univ Sci & Technol China, Suzhou Inst Adv Res, Suzhou 215123, Jiangsu, Peoples R China
[3] Soochow Univ, Sch Comp Sci & Technol, Suzhou 215123, Jiangsu, Peoples R China
[4] Univ Buffalo, Dept Comp Sci & Engn, Buffalo, NY 16260 USA
基金
美国国家科学基金会;
关键词
Task analysis; Servers; Routing; Training; Aggregates; Topology; Switches; In-network aggregation; distributed training; task placement; gradient routing;
D O I
10.1109/TNET.2024.3414853
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
With the increase in both the model size and dataset size of distributed training (DT) tasks, communication between the workers and parameter servers (PSs) in a cluster has become a bottleneck. In-network aggregation (INA) enabled by programmable switches has been proposed as a promising solution to alleviate the communication bottleneck. However, existing works focused on in-network aggregation implementation based on simple DT placement and fixed routing policies, which may lead to a large communication overhead and inefficient use of resources (e.g., storage, computing power and bandwidth). In this paper, we propose PARING, the first-of-its-kind INA approach that jointly optimizes DT task placement and routing in order to reduce traffic volume and minimize communication time. We formulate the problem as a nonlinear multi-objective mixed-integer programming problem, and prove its NP-Hardness. Based on the concept of Steiner trees, an algorithm with bounded approximation factors is proposed for this problem. Large-scale simulations show that our algorithm can reduce communication time by up to 81.0% and traffic volume by up to 19.1% compared to the state-of-the-art algorithms.
引用
收藏
页码:4317 / 4332
页数:16
相关论文
共 50 条
  • [31] Preemptive Switch Memory Usage to Accelerate Training Jobs with Shared In-Network Aggregation
    Wang, Hao
    Qin, Yuxuan
    Lao, ChonLam
    Le, Yanfang
    Wu, Wenfei
    Chen, Kai
    2023 IEEE 31ST INTERNATIONAL CONFERENCE ON NETWORK PROTOCOLS, ICNP, 2023,
  • [32] Joint placement, routing and dimensioning at the network edge for energy minimization
    Elkael, Maxime
    Araldo, Andrea
    D'Oro, Salvatore
    Castel-Taleb, Hind
    Aba, Massinissa Ait
    Jouaber, Badii
    IEEE CONFERENCE ON GLOBAL COMMUNICATIONS, GLOBECOM, 2023, : 941 - 946
  • [33] In-Network Distributed Algorithm for Energy Optimal Routing Based on Dual Decomposition of Linear Programming
    Trdlicka, Jiri
    Hanzalek, Zdenek
    IEEE TRANSACTIONS ON COMMUNICATIONS, 2012, 60 (06) : 1634 - 1645
  • [34] Joint Optimization of Task Mapping and Routing for Service Provisioning in Distributed Datacenters
    Huang, Huawei
    Zeng, Deze
    Guo, Song
    Yao, Hong
    2014 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC), 2014, : 4196 - 4201
  • [35] ABRM: In-Network Aggregation Based Routing Protocol for Mobile Sensor Networks with Multiple Mobile Sinks
    Soliman, Maged S.
    Fahmy, Hossam M. A.
    Salem, Ashraf E.
    2013 IEEE 27TH INTERNATIONAL CONFERENCE ON ADVANCED INFORMATION NETWORKING AND APPLICATIONS (AINA), 2013, : 340 - 347
  • [36] Optimizing Task Placement and Online Scheduling for Distributed GNN Training Acceleration
    Luo, Ziyue
    Bao, Yixin
    Wu, Chuan
    IEEE CONFERENCE ON COMPUTER COMMUNICATIONS (IEEE INFOCOM 2022), 2022, : 890 - 899
  • [37] JASPER: Joint Optimization of Scaling, Placement, and Routing of Virtual Network Services
    Draexler, Sevil
    Karl, Holger
    Mann, Zoltan Adam
    IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, 2018, 15 (03): : 946 - 960
  • [38] Toward Efficient Distributed Algorithms for In-Network Binary Operator Tree Placement in Wireless Sensor Networks
    Lu, Zongqing
    Wen, Yonggang
    Fan, Rui
    Tan, Su-Lim
    Biswas, Jit
    IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, 2013, 31 (04) : 743 - 755
  • [39] Joint Optimization of Virtualized Network Function Placement and Routing Allocation for Operational Expenditure
    Shi Jiugen
    Zhang Jing
    Xu Hao
    Wang Ji
    Sun Li
    JOURNAL OF ELECTRONICS & INFORMATION TECHNOLOGY, 2019, 41 (04) : 973 - 979
  • [40] Next Generation Optical Network Architecture Featuring Distributed Aggregation, Network Processing and Information Routing
    Orphanoudakis, Theofanis G.
    Matrakidis, Chris
    Stavdas, Alexandros
    2014 EUROPEAN CONFERENCE ON NETWORKS AND COMMUNICATIONS (EUCNC), 2014,