TCCL: Co-optimizing Collective Communication and Traffic Routing for GPU-centric Clusters

被引：0

作者：

Li, Baojia ^{[1
]}

Wang, Xiaoliang ^{[2
]}

Wang, Jingzhu ^{[1
]}

Liu, Yifan ^{[2
]}

Gong, Yuanyuan ^{[1
]}

Lu, Hao ^{[1
]}

Dang, Weizhen ^{[1
]}

Zhang, Weifeng ^{[1
]}

Huang, Xiaojie ^{[1
]}

Chen, Mingzhuo ^{[1
]}

Chen, Jie ^{[1
]}

He, Chunzhi ^{[1
]}

Liu, Yadong ^{[1
]}

Hu, Xiaoyuan ^{[1
]}

Liu, Chen ^{[1
]}

Ji, Xuefeng ^{[1
]}

Xia, Yinben ^{[1
]}

Li, Xiang ^{[1
]}

He, Zekun ^{[1
]}

Wang, Yachen ^{[1
]}

Zou, Xianneng ^{[1
]}

机构：

[1] Tencent, Shenzhen, Guangdong, Peoples R China

[2] Nanjing Univ, Nanjing, Jiangsu, Peoples R China

来源：

PROCEEDINGS OF THE 2024 SIGCOMM WORKSHOP ON NETWORKS FOR AI COMPUTING, NAIC 2024 | 2024年

关键词：

Collective Communication; Load Balancing; Centralized control and management;

D O I：

10.1145/3672198.3673799

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

GPU-centric clusters are increasingly deployed to support many AI services including the task of large language model (LLM) training. Notably, the corresponding networks have demonstrated multiple new characteristics, such as the boundary of the network operation has been extended from switch network to GPU interconnection network; the communication pattern is regular and predictable; the training is easily affected by the network jitters. These introduce new challenges and opportunities for network operators to build high-performance resilient networking systems. In this paper, we present TCCL, an operational practice to manage GPU-centric networks with over 10K heterogeneous GPU cards. We argue that GPU-centric networks require joint optimization of topology-aware collective communication at the host and centralized routing management in the multi-path network. By leveraging the characteristics of the GPU-centric network, TCCL fully utilizes the high bandwidth of both GPU and switch networks in parallel, reduces the delay of collective communication with short paths across nodes, and avoids network congestion caused by route conflict through traffic planning.

引用

页码：48 / 53

页数：6

共 9 条

[1] GPU-centric Communication on NVIDIA GPU Clusters with InfiniBand: A Case Study with OpenSHMEM
Potluri, S.
Goswami, A.
Rossetti, D.
Newburn, C. J.
Venkata, M. Gorentla
Imam, N.
2017 IEEE 24TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING (HIPC), 2017, : 253 - 262
[2] Co-optimizing Latency and Energy for IoT services using HMP servers in Fog Clusters
Shukla, Sambit
Ghosal, Dipak
Wu, Kesheng
Sim, Alex
Farrens, Matthew
2019 FOURTH INTERNATIONAL CONFERENCE ON FOG AND MOBILE EDGE COMPUTING (FMEC), 2019, : 121 - 128
[3] gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters
Huang, Jiajun
Di, Sheng
Yu, Xiaodong
Zhai, Yujia
Liu, Jinyang
Huang, Yafan
Raffenetti, Ken
Zhou, Hui
Zhao, Kai
Lu, Xiaoyi
Chen, Zizhong
Cappello, Franck
Guo, Yanfei
Thakur, Rajeev
PROCEEDINGS OF THE 38TH ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING, ACM ICS 2024, 2024, : 437 - 448
[4] POSTER: Optimizing Collective Communications with Error-bounded Lossy Compression for GPU Clusters
Huang, Jiajun
Di, Sheng
Yu, Xiaodong
Zhai, Yujia
Liu, Jinyang
Huang, Yafan
Raffenetti, Ken
Zhou, Hui
Zhao, Kai
Chen, Zizhong
Cappello, Franck
Guo, Yanfei
Thakur, Rajeev
PROCEEDINGS OF THE 29TH ACM SIGPLAN ANNUAL SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING, PPOPP 2024, 2024, : 454 - 456
[5] Designing and Optimizing GPU-aware Nonblocking MPI Neighborhood Collective Communication for PETSc*
Khorassani, Kawthar Shafie
Chen, Chen-Chun
Subramoni, Hari
Panda, Dhabaleswar K.
2023 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM, IPDPS, 2023, : 646 - 656
[6] Co-Optimizing Performance and Memory Footprint Via Integrated CPU/GPU Memory Management, an Implementation on Autonomous Driving Platform
Bateni, Soroush
Wang, Zhendong
Zhu, Yuankun
Hu, Yang
Liu, Cong
2020 IEEE REAL-TIME AND EMBEDDED TECHNOLOGY AND APPLICATIONS SYMPOSIUM (RTAS 2020), 2020, : 310 - 323
[7] Steiner Tree-Based Design of Communication Infrastructure With Co-Optimizing the PMU Placement for Economical Design of WAMS
Patel, Chintan D.
Tailor, Tarun Kumar
Shukla, Sunil Kumar
Shah, Samyak
Jani, Swapnil N.
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2022, 71
[8] An approach for economic design of wide area monitoring system by co-optimizing phasor measurement unit placement and associated communication infrastructure
Patel, Chintan D.
Tailor, Tarun K.
Shah, Samyak S.
Shrivastava, Shivam H.
INTERNATIONAL TRANSACTIONS ON ELECTRICAL ENERGY SYSTEMS, 2021, 31 (08)
[9] Improving Performance and Power by Co-Optimizing Middle-of-Line Routing, Pin Pattern Generation, and Contact over Active Gates in Standard Cell Layout Synthesis
Chung, Sehyeon
Jeong, Jooyeon
Kim, Taewhan
2022 ACM/IEEE INTERNATIONAL SYMPOSIUM ON LOW POWER ELECTRONICS AND DESIGN, ISLPED 2022, 2022,

← 1 →