Improving the Performance of Distributed TensorFlow with RDMA

被引:23
|
作者
Jia, Chengfan [1 ]
Liu, Junnan [1 ]
Jin, Xu [1 ]
Lin, Han [1 ]
An, Hong [1 ]
Han, Wenting [1 ]
Wu, Zheng [1 ]
Chi, Mengxian [1 ]
机构
[1] Univ Sci & Technol China, Hefei 230026, Anhui, Peoples R China
关键词
Distributed TensorFlow; RDMA; Infiniband; Optimization;
D O I
10.1007/s10766-017-0520-3
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
TensorFlow is an open-source software library designed for Deep Learning using dataflow graph computation. Thanks to the flexible architecture of TensorFlow, users can deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API. In a distributed TensorFlow work process, it uses gRPC to connect between different nodes. However, when deploying training tasks on high performance computing clusters, the performance of gRPC becomes a bottleneck of distributed TensorFlow system. HPC clusters are usually equipped with Infiniband network, in addition to traditional TCP/IP network. But open- sourced TensorFlow has not taken this advantage. We present a RDMA-capable design of TensorFlow. By porting the Tensor send/receive parts of TensorFlow into RDMA verbs, we finally get nearly 6x performance improvements over the original distributed TensorFlow, based on gRPC. The TensorFlow system with RDMA support shows a great scalability among the training scale.
引用
收藏
页码:674 / 685
页数:12
相关论文
共 50 条
  • [41] Gengar: An RDMA-based Distributed Hybrid Memory Pool
    Duan, Zhuohui
    Liu, Haikun
    Lu, Haodi
    Liao, Xiaofei
    Jin, Hai
    Zhang, Yu
    He, Bingsheng
    2021 IEEE 41ST INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS 2021), 2021, : 92 - 103
  • [42] RDMA vs. RPC for Implementing Distributed Data Structures
    Brock, Benjamin
    Chen, Yuxin
    Yan, Jiakun
    Owens, John D.
    Buluc, Aydin
    Yelick, Katherine
    2019 IEEE/ACM 9TH WORKSHOP ON IRREGULAR APPLICATIONS - ARCHITECTURES AND ALGORITHMS (IA3), 2019, : 17 - 22
  • [43] Distributed Analytics in Fog Computing Platforms Using TensorFlow and Kubernetes
    Tsai, Pei-Hsuan
    Hong, Hua-Jun
    Cheng, An-Chieh
    Hsu, Cheng-Hsin
    2017 19TH ASIA-PACIFIC NETWORK OPERATIONS AND MANAGEMENT SYMPOSIUM (APNOMS 2017): MANAGING A WORLD OF THINGS, 2017, : 145 - 150
  • [44] Collie: Finding Performance Anomalies in RDMA Subsystems
    Kong, Xinhao
    Zhu, Yibo
    Zhou, Huaping
    Jiang, Zhuo
    Ye, Jianxi
    Guo, Chuanxiong
    Zhuo, Danyang
    PROCEEDINGS OF THE 19TH USENIX SYMPOSIUM ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION (NSDI '22), 2022, : 287 - 305
  • [45] DeepMarket: An Edge Computing Marketplace with Distributed TensorFlow Execution Capability
    Yerabolu, Susham
    Kim, Soyoung
    Gomena, Samuel
    Li, Xuanzhe
    Patel, Rohan
    Bhise, Shraddha
    Aryafar, Ehsan
    IEEE CONFERENCE ON COMPUTER COMMUNICATIONS WORKSHOPS (IEEE INFOCOM 2019 WKSHPS), 2019, : 32 - 37
  • [46] Understanding RDMA Microarchitecture Resources for Performance Isolation
    Kong, Xinhao
    Chen, Jingrong
    Bai, Wei
    Xu, Yechen
    Elhaddad, Mahmoud
    Raindel, Shachar
    Padhye, Jitendra
    Lebeck, Alvin R.
    Zhuo, Danyang
    PROCEEDINGS OF THE 20TH USENIX SYMPOSIUM ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION, NSDI 2023, 2023, : 31 - 48
  • [47] Design Guidelines for High Performance RDMA Systems
    Kalia, Anuj
    Kaminsky, Michael
    Andersen, David G.
    PROCEEDINGS OF USENIX ATC '16: 2016 USENIX ANNUAL TECHNICAL CONFERENCE, 2016, : 437 - 450
  • [48] Optimal distributed parallel algorithms for deep learning framework Tensorflow
    Yuanlun Xie
    Majun He
    Tingsong Ma
    Wenhong Tian
    Applied Intelligence, 2022, 52 : 3880 - 3900
  • [49] Optimal distributed parallel algorithms for deep learning framework Tensorflow
    Xie, Yuanlun
    He, Majun
    Ma, Tingsong
    Tian, Wenhong
    APPLIED INTELLIGENCE, 2022, 52 (04) : 3880 - 3900
  • [50] Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation
    Awan, Ammar Ahmad
    Chu, Ching-Hsiang
    Subramoni, Hari
    Panda, Dhabaleswar K.
    Bedorf, Jeroen
    2019 19TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID), 2019, : 498 - 507