PARING: Joint Task Placement and Routing for Distributed Training With In-Network Aggregation

被引:0
|
作者
Qiu, Yuhang [1 ,2 ]
Zhao, Gongming [1 ,2 ]
Xu, Hongli [1 ,2 ]
Huang, He [3 ]
Qiao, Chunming [4 ]
机构
[1] Univ Sci & Technol China, Sch Comp Sci & Technol, Hefei 230027, Anhui, Peoples R China
[2] Univ Sci & Technol China, Suzhou Inst Adv Res, Suzhou 215123, Jiangsu, Peoples R China
[3] Soochow Univ, Sch Comp Sci & Technol, Suzhou 215123, Jiangsu, Peoples R China
[4] Univ Buffalo, Dept Comp Sci & Engn, Buffalo, NY 16260 USA
基金
美国国家科学基金会;
关键词
Task analysis; Servers; Routing; Training; Aggregates; Topology; Switches; In-network aggregation; distributed training; task placement; gradient routing;
D O I
10.1109/TNET.2024.3414853
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
With the increase in both the model size and dataset size of distributed training (DT) tasks, communication between the workers and parameter servers (PSs) in a cluster has become a bottleneck. In-network aggregation (INA) enabled by programmable switches has been proposed as a promising solution to alleviate the communication bottleneck. However, existing works focused on in-network aggregation implementation based on simple DT placement and fixed routing policies, which may lead to a large communication overhead and inefficient use of resources (e.g., storage, computing power and bandwidth). In this paper, we propose PARING, the first-of-its-kind INA approach that jointly optimizes DT task placement and routing in order to reduce traffic volume and minimize communication time. We formulate the problem as a nonlinear multi-objective mixed-integer programming problem, and prove its NP-Hardness. Based on the concept of Steiner trees, an algorithm with bounded approximation factors is proposed for this problem. Large-scale simulations show that our algorithm can reduce communication time by up to 81.0% and traffic volume by up to 19.1% compared to the state-of-the-art algorithms.
引用
收藏
页码:4317 / 4332
页数:16
相关论文
共 50 条
  • [21] Opportunistic routing with in-network aggregation for duty-cycled WSNs with delay requirements
    So, Jungmin
    Byun, Heejung
    EURASIP JOURNAL ON WIRELESS COMMUNICATIONS AND NETWORKING, 2014,
  • [22] In-network Training and Distributed Event Detection in Wireless Sensor Networks
    Wittenburg, Georg
    Dziengel, Norman
    Schiller, Jochen
    SENSYS'08: PROCEEDINGS OF THE 6TH ACM CONFERENCE ON EMBEDDED NETWORKED SENSOR SYSTEMS, 2008, : 387 - 388
  • [23] DRINA: A Lightweight and Reliable Routing Approach for In-Network Aggregation in Wireless Sensor Networks
    Villas, Leandro Aparecido
    Boukerche, Azzedine
    Ramos, Heitor Soares
    Fernandes de Oliveira, Horacio A. B.
    de Araujo, Regina Borges
    Ferreira Loureiro, Antonio Alfredo
    IEEE TRANSACTIONS ON COMPUTERS, 2013, 62 (04) : 676 - 689
  • [24] Orchestrating In-Network Aggregation for Distributed Machine Learning via In-Band Network Telemetry
    Ji, Ming-Tao
    Jin, Yi-Bo
    Qian, Zhu-Zhong
    Cao, Tuo
    Ye, Bao-Liu
    JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2025, 40 (01) : 196 - 214
  • [25] Q-learning based routing for in-network aggregation in wireless sensor networks
    Maivizhi, Radhakrishnan
    Yogesh, Palanichamy
    WIRELESS NETWORKS, 2021, 27 (03) : 2231 - 2250
  • [26] Joint Network Slicing, Routing, and In-Network Computing for Energy-Efficient 6G
    Sasan, Zeinab
    Shokrnezhad, Masoud
    Khorsandi, Siavash
    Taleb, Tarik
    2024 IEEE WIRELESS COMMUNICATIONS AND NETWORKING CONFERENCE, WCNC 2024, 2024,
  • [27] Opportunistic routing with in-network aggregation for asynchronous duty-cycled wireless sensor networks
    So, Jungmin
    Byun, Heejung
    WIRELESS NETWORKS, 2014, 20 (05) : 833 - 846
  • [28] Joint task placement, routing and power control for low power mobile grid computing in ad hoc network
    Li, M
    Wu, XB
    Zhao, ML
    Wang, H
    Li, P
    Yan, XL
    GRID AND COOPERATIVE COMPUTING GCC 2004, PROCEEDINGS, 2004, 3251 : 591 - 600
  • [29] Opportunistic routing with in-network aggregation for asynchronous duty-cycled wireless sensor networks
    Jungmin So
    Heejung Byun
    Wireless Networks, 2014, 20 : 833 - 846
  • [30] Joint Placement and Routing of Network Function Chains in Data Centers
    Guo, Linqi
    Pang, John
    Walid, Anwar
    IEEE CONFERENCE ON COMPUTER COMMUNICATIONS (IEEE INFOCOM 2018), 2018, : 612 - 620