Improving the Performance of Distributed TensorFlow with RDMA

被引：23

作者：

Jia, Chengfan ^{[1
]}

Liu, Junnan ^{[1
]}

Jin, Xu ^{[1
]}

Lin, Han ^{[1
]}

An, Hong ^{[1
]}

Han, Wenting ^{[1
]}

Wu, Zheng ^{[1
]}

Chi, Mengxian ^{[1
]}

机构：

[1] Univ Sci & Technol China, Hefei 230026, Anhui, Peoples R China

来源：

INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING | 2018年 / 46卷 / 04期

关键词：

Distributed TensorFlow; RDMA; Infiniband; Optimization;

D O I：

10.1007/s10766-017-0520-3

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

TensorFlow is an open-source software library designed for Deep Learning using dataflow graph computation. Thanks to the flexible architecture of TensorFlow, users can deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API. In a distributed TensorFlow work process, it uses gRPC to connect between different nodes. However, when deploying training tasks on high performance computing clusters, the performance of gRPC becomes a bottleneck of distributed TensorFlow system. HPC clusters are usually equipped with Infiniband network, in addition to traditional TCP/IP network. But open- sourced TensorFlow has not taken this advantage. We present a RDMA-capable design of TensorFlow. By porting the Tensor send/receive parts of TensorFlow into RDMA verbs, we finally get nearly 6x performance improvements over the original distributed TensorFlow, based on gRPC. The TensorFlow system with RDMA support shows a great scalability among the training scale.

引用

页码：674 / 685

页数：12

共 50 条

[1] Improving the Performance of Distributed TensorFlow with RDMA
Chengfan Jia
Junnan Liu
Xu Jin
Han Lin
Hong An
Wenting Han
Zheng Wu
Mengxian Chi
International Journal of Parallel Programming, 2018, 46 : 674 - 685
[2] Improving the Performance of Distributed MXNet with RDMA
Li, Mingfan
Wen, Ke
Lin, Han
Jin, Xu
Wu, Zheng
An, Hong
Chi, Mengxian
INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2019, 47 (03) : 467 - 480
[3] Improving the Performance of Distributed MXNet with RDMA
Mingfan Li
Ke Wen
Han Lin
Xu Jin
Zheng Wu
Hong An
Mengxian Chi
International Journal of Parallel Programming, 2019, 47 : 467 - 480
[4] MC-RDMA: Improving Replication Performance of RDMA-based Distributed Systems with Reliable Multicast Support
Huang, Chengyuan
Gao, Yixiao
Chen, Wei
Li, Duoxing
Xiao, Yibo
Zhang, Ruyi
Tian, Chen
Wang, Xiaoliang
Dou, Wanchun
Chen, Guihai
Wang, Yi
Xiao, Fu
2023 IEEE 31ST INTERNATIONAL CONFERENCE ON NETWORK PROTOCOLS, ICNP, 2023,
[5] Accelerating TensorFlow with Adaptive RDMA-based gRPC
Biswas, Rajarshi
Lu, Xiaoyi
Panda, Dhabaleswar K.
2018 IEEE 25TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING (HIPC), 2018, : 2 - 11
[6] Impact of RDMA Communication on the Performance of Distributed BFS Algorithm
Guney, Isa Ahmet
Ovant, Burak Sezin
Baydere, Sebnem
2016 INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING & SIMULATION (HPCS 2016), 2016, : 350 - 356
[7] DArray: A High Performance RDMA-Based Distributed Array
Ding, Baorong
Han, Mingcong
Chen, Rong
PROCEEDINGS OF THE 52ND INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2023, 2023, : 715 - 724
[8] A Survey of RDMA Distributed Storage
Wang, Ziqi
Liu, Yaping
Zhang, Shuo
Hu, Jinrui
Liu, Xinyi
2024 5TH INTERNATIONAL CONFERENCE ON COMPUTING, NETWORKS AND INTERNET OF THINGS, CNIOT 2024, 2024, : 534 - 539
[9] Protocol Customization for Improving MPI Performance on RDMA-Enabled Clusters
Gu, Zheng
Small, Matthew
Yuan, Xin
Marathe, Aniruddha
Lowenthal, David K.
INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2013, 41 (05) : 682 - 703
[10] Protocol Customization for Improving MPI Performance on RDMA-Enabled Clusters
Zheng Gu
Matthew Small
Xin Yuan
Aniruddha Marathe
David K. Lowenthal
International Journal of Parallel Programming, 2013, 41 : 682 - 703

← 1 2 3 4 5 →