TCCL: Co-optimizing Collective Communication and Traffic Routing for GPU-centric Clusters

被引:0
|
作者
Li, Baojia [1 ]
Wang, Xiaoliang [2 ]
Wang, Jingzhu [1 ]
Liu, Yifan [2 ]
Gong, Yuanyuan [1 ]
Lu, Hao [1 ]
Dang, Weizhen [1 ]
Zhang, Weifeng [1 ]
Huang, Xiaojie [1 ]
Chen, Mingzhuo [1 ]
Chen, Jie [1 ]
He, Chunzhi [1 ]
Liu, Yadong [1 ]
Hu, Xiaoyuan [1 ]
Liu, Chen [1 ]
Ji, Xuefeng [1 ]
Xia, Yinben [1 ]
Li, Xiang [1 ]
He, Zekun [1 ]
Wang, Yachen [1 ]
Zou, Xianneng [1 ]
机构
[1] Tencent, Shenzhen, Guangdong, Peoples R China
[2] Nanjing Univ, Nanjing, Jiangsu, Peoples R China
关键词
Collective Communication; Load Balancing; Centralized control and management;
D O I
10.1145/3672198.3673799
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
GPU-centric clusters are increasingly deployed to support many AI services including the task of large language model (LLM) training. Notably, the corresponding networks have demonstrated multiple new characteristics, such as the boundary of the network operation has been extended from switch network to GPU interconnection network; the communication pattern is regular and predictable; the training is easily affected by the network jitters. These introduce new challenges and opportunities for network operators to build high-performance resilient networking systems. In this paper, we present TCCL, an operational practice to manage GPU-centric networks with over 10K heterogeneous GPU cards. We argue that GPU-centric networks require joint optimization of topology-aware collective communication at the host and centralized routing management in the multi-path network. By leveraging the characteristics of the GPU-centric network, TCCL fully utilizes the high bandwidth of both GPU and switch networks in parallel, reduces the delay of collective communication with short paths across nodes, and avoids network congestion caused by route conflict through traffic planning.
引用
收藏
页码:48 / 53
页数:6
相关论文
共 9 条
  • [1] GPU-centric Communication on NVIDIA GPU Clusters with InfiniBand: A Case Study with OpenSHMEM
    Potluri, S.
    Goswami, A.
    Rossetti, D.
    Newburn, C. J.
    Venkata, M. Gorentla
    Imam, N.
    2017 IEEE 24TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING (HIPC), 2017, : 253 - 262
  • [2] Co-optimizing Latency and Energy for IoT services using HMP servers in Fog Clusters
    Shukla, Sambit
    Ghosal, Dipak
    Wu, Kesheng
    Sim, Alex
    Farrens, Matthew
    2019 FOURTH INTERNATIONAL CONFERENCE ON FOG AND MOBILE EDGE COMPUTING (FMEC), 2019, : 121 - 128
  • [3] gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters
    Huang, Jiajun
    Di, Sheng
    Yu, Xiaodong
    Zhai, Yujia
    Liu, Jinyang
    Huang, Yafan
    Raffenetti, Ken
    Zhou, Hui
    Zhao, Kai
    Lu, Xiaoyi
    Chen, Zizhong
    Cappello, Franck
    Guo, Yanfei
    Thakur, Rajeev
    PROCEEDINGS OF THE 38TH ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING, ACM ICS 2024, 2024, : 437 - 448
  • [4] POSTER: Optimizing Collective Communications with Error-bounded Lossy Compression for GPU Clusters
    Huang, Jiajun
    Di, Sheng
    Yu, Xiaodong
    Zhai, Yujia
    Liu, Jinyang
    Huang, Yafan
    Raffenetti, Ken
    Zhou, Hui
    Zhao, Kai
    Chen, Zizhong
    Cappello, Franck
    Guo, Yanfei
    Thakur, Rajeev
    PROCEEDINGS OF THE 29TH ACM SIGPLAN ANNUAL SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING, PPOPP 2024, 2024, : 454 - 456
  • [5] Designing and Optimizing GPU-aware Nonblocking MPI Neighborhood Collective Communication for PETSc*
    Khorassani, Kawthar Shafie
    Chen, Chen-Chun
    Subramoni, Hari
    Panda, Dhabaleswar K.
    2023 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM, IPDPS, 2023, : 646 - 656
  • [6] Co-Optimizing Performance and Memory Footprint Via Integrated CPU/GPU Memory Management, an Implementation on Autonomous Driving Platform
    Bateni, Soroush
    Wang, Zhendong
    Zhu, Yuankun
    Hu, Yang
    Liu, Cong
    2020 IEEE REAL-TIME AND EMBEDDED TECHNOLOGY AND APPLICATIONS SYMPOSIUM (RTAS 2020), 2020, : 310 - 323
  • [7] Steiner Tree-Based Design of Communication Infrastructure With Co-Optimizing the PMU Placement for Economical Design of WAMS
    Patel, Chintan D.
    Tailor, Tarun Kumar
    Shukla, Sunil Kumar
    Shah, Samyak
    Jani, Swapnil N.
    IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2022, 71
  • [8] An approach for economic design of wide area monitoring system by co-optimizing phasor measurement unit placement and associated communication infrastructure
    Patel, Chintan D.
    Tailor, Tarun K.
    Shah, Samyak S.
    Shrivastava, Shivam H.
    INTERNATIONAL TRANSACTIONS ON ELECTRICAL ENERGY SYSTEMS, 2021, 31 (08)
  • [9] Improving Performance and Power by Co-Optimizing Middle-of-Line Routing, Pin Pattern Generation, and Contact over Active Gates in Standard Cell Layout Synthesis
    Chung, Sehyeon
    Jeong, Jooyeon
    Kim, Taewhan
    2022 ACM/IEEE INTERNATIONAL SYMPOSIUM ON LOW POWER ELECTRONICS AND DESIGN, ISLPED 2022, 2022,