Improving the Performance of Distributed TensorFlow with RDMA

被引:23
|
作者
Jia, Chengfan [1 ]
Liu, Junnan [1 ]
Jin, Xu [1 ]
Lin, Han [1 ]
An, Hong [1 ]
Han, Wenting [1 ]
Wu, Zheng [1 ]
Chi, Mengxian [1 ]
机构
[1] Univ Sci & Technol China, Hefei 230026, Anhui, Peoples R China
关键词
Distributed TensorFlow; RDMA; Infiniband; Optimization;
D O I
10.1007/s10766-017-0520-3
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
TensorFlow is an open-source software library designed for Deep Learning using dataflow graph computation. Thanks to the flexible architecture of TensorFlow, users can deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API. In a distributed TensorFlow work process, it uses gRPC to connect between different nodes. However, when deploying training tasks on high performance computing clusters, the performance of gRPC becomes a bottleneck of distributed TensorFlow system. HPC clusters are usually equipped with Infiniband network, in addition to traditional TCP/IP network. But open- sourced TensorFlow has not taken this advantage. We present a RDMA-capable design of TensorFlow. By porting the Tensor send/receive parts of TensorFlow into RDMA verbs, we finally get nearly 6x performance improvements over the original distributed TensorFlow, based on gRPC. The TensorFlow system with RDMA support shows a great scalability among the training scale.
引用
收藏
页码:674 / 685
页数:12
相关论文
共 50 条
  • [1] Improving the Performance of Distributed TensorFlow with RDMA
    Chengfan Jia
    Junnan Liu
    Xu Jin
    Han Lin
    Hong An
    Wenting Han
    Zheng Wu
    Mengxian Chi
    International Journal of Parallel Programming, 2018, 46 : 674 - 685
  • [2] Improving the Performance of Distributed MXNet with RDMA
    Li, Mingfan
    Wen, Ke
    Lin, Han
    Jin, Xu
    Wu, Zheng
    An, Hong
    Chi, Mengxian
    INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2019, 47 (03) : 467 - 480
  • [3] Improving the Performance of Distributed MXNet with RDMA
    Mingfan Li
    Ke Wen
    Han Lin
    Xu Jin
    Zheng Wu
    Hong An
    Mengxian Chi
    International Journal of Parallel Programming, 2019, 47 : 467 - 480
  • [4] MC-RDMA: Improving Replication Performance of RDMA-based Distributed Systems with Reliable Multicast Support
    Huang, Chengyuan
    Gao, Yixiao
    Chen, Wei
    Li, Duoxing
    Xiao, Yibo
    Zhang, Ruyi
    Tian, Chen
    Wang, Xiaoliang
    Dou, Wanchun
    Chen, Guihai
    Wang, Yi
    Xiao, Fu
    2023 IEEE 31ST INTERNATIONAL CONFERENCE ON NETWORK PROTOCOLS, ICNP, 2023,
  • [5] Accelerating TensorFlow with Adaptive RDMA-based gRPC
    Biswas, Rajarshi
    Lu, Xiaoyi
    Panda, Dhabaleswar K.
    2018 IEEE 25TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING (HIPC), 2018, : 2 - 11
  • [6] Impact of RDMA Communication on the Performance of Distributed BFS Algorithm
    Guney, Isa Ahmet
    Ovant, Burak Sezin
    Baydere, Sebnem
    2016 INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING & SIMULATION (HPCS 2016), 2016, : 350 - 356
  • [7] DArray: A High Performance RDMA-Based Distributed Array
    Ding, Baorong
    Han, Mingcong
    Chen, Rong
    PROCEEDINGS OF THE 52ND INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2023, 2023, : 715 - 724
  • [8] A Survey of RDMA Distributed Storage
    Wang, Ziqi
    Liu, Yaping
    Zhang, Shuo
    Hu, Jinrui
    Liu, Xinyi
    2024 5TH INTERNATIONAL CONFERENCE ON COMPUTING, NETWORKS AND INTERNET OF THINGS, CNIOT 2024, 2024, : 534 - 539
  • [9] Protocol Customization for Improving MPI Performance on RDMA-Enabled Clusters
    Gu, Zheng
    Small, Matthew
    Yuan, Xin
    Marathe, Aniruddha
    Lowenthal, David K.
    INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2013, 41 (05) : 682 - 703
  • [10] Protocol Customization for Improving MPI Performance on RDMA-Enabled Clusters
    Zheng Gu
    Matthew Small
    Xin Yuan
    Aniruddha Marathe
    David K. Lowenthal
    International Journal of Parallel Programming, 2013, 41 : 682 - 703