Improving the Performance of Distributed TensorFlow with RDMA

被引:23
|
作者
Jia, Chengfan [1 ]
Liu, Junnan [1 ]
Jin, Xu [1 ]
Lin, Han [1 ]
An, Hong [1 ]
Han, Wenting [1 ]
Wu, Zheng [1 ]
Chi, Mengxian [1 ]
机构
[1] Univ Sci & Technol China, Hefei 230026, Anhui, Peoples R China
关键词
Distributed TensorFlow; RDMA; Infiniband; Optimization;
D O I
10.1007/s10766-017-0520-3
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
TensorFlow is an open-source software library designed for Deep Learning using dataflow graph computation. Thanks to the flexible architecture of TensorFlow, users can deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API. In a distributed TensorFlow work process, it uses gRPC to connect between different nodes. However, when deploying training tasks on high performance computing clusters, the performance of gRPC becomes a bottleneck of distributed TensorFlow system. HPC clusters are usually equipped with Infiniband network, in addition to traditional TCP/IP network. But open- sourced TensorFlow has not taken this advantage. We present a RDMA-capable design of TensorFlow. By porting the Tensor send/receive parts of TensorFlow into RDMA verbs, we finally get nearly 6x performance improvements over the original distributed TensorFlow, based on gRPC. The TensorFlow system with RDMA support shows a great scalability among the training scale.
引用
收藏
页码:674 / 685
页数:12
相关论文
共 50 条
  • [11] Improving Spark Performance with Zero-copy Buffer Management and RDMA
    Li, Hu
    Chen, Tianjia
    Xu, Wei
    2016 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), 2016,
  • [12] RDMA Based Performance Optimization on Distributed Database Systems: A Case Study with GoldenX
    Tu, Yaofeng
    Han, Yinjun
    Jin, Hao
    Chen, Zhenghua
    Zhao, Yanchao
    WIRELESS ALGORITHMS, SYSTEMS, AND APPLICATIONS, WASA 2021, PT II, 2021, 12938 : 237 - 248
  • [13] Detailed Performance Analysis of Distributed Tensorflow on a GPU Cluster using Deep Learning Algorithms
    Malik, Abid
    Lu, Micheal
    Wang, Nathenial
    Lin, Yeiwei
    Yoo, Shinjae
    2018 NEW YORK SCIENTIFIC DATA SUMMIT (NYSDS), 2018,
  • [14] TensorFlow Doing HPC An Evaluation of TensorFlow Performance in HPC Applications
    Chien, Steven W. D.
    Markidis, Stefano
    Olshevsky, Vyacheslav
    Bulatov, Yaroslav
    Laure, Erwin
    Vetter, Jeffrey S.
    2019 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2019, : 509 - 518
  • [15] PRISM: Rethinking the RDMA Interface for Distributed Systems
    Burke, Matthew
    Dharanipragada, Sowmya
    Joyner, Shannon
    Szekeres, Adriana
    Nelson, Jacob
    Zhang, Irene
    Ports, Dan R. K.
    PROCEEDINGS OF THE 28TH ACM SYMPOSIUM ON OPERATING SYSTEMS PRINCIPLES, SOSP 2021, 2021, : 228 - 242
  • [16] Efficient Distributed Memory Management with RDMA and Caching
    Cai, Qingchao
    Guo, Wentian
    Zhang, Hao
    Agrawal, Divyakant
    Chen, Gang
    Ooi, Beng Chin
    Tan, Kian-Lee
    Teo, Yong Meng
    Wang, Sheng
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2018, 11 (11): : 1604 - 1617
  • [17] Fast Distributed Deep Learning over RDMA
    Xue, Jilong
    Miao, Youshan
    Chen, Cheng
    Wu, Ming
    Zhang, Lintao
    Zhou, Lidong
    PROCEEDINGS OF THE FOURTEENTH EUROSYS CONFERENCE 2019 (EUROSYS '19), 2019,
  • [18] Performance Isolation Anomalies in RDMA
    Zhang, Yiwen
    Gu, Juncheng
    Lee, Youngmoon
    Chowdhury, Mosharaf
    Shin, Kang G.
    KBNETS '17: PROCEEDINGS OF THE 2017 WORKSHOP ON KERNEL-BYPASS NETWORKS, 2017, : 43 - 48
  • [19] Distributed Deep Reinforcement Learning using TensorFlow
    Rao, P. Ajay
    Kumar, Navaneesh B.
    Cadabam, Siddharth
    Praveena, T.
    2017 INTERNATIONAL CONFERENCE ON CURRENT TRENDS IN COMPUTER, ELECTRICAL, ELECTRONICS AND COMMUNICATION (CTCEEC), 2017, : 171 - 174
  • [20] Improving Query Performance in Distributed Database
    Boicea, Alexandru
    Radulescu, Florin
    Truica, Ciprian-Octavian
    Urse, Loredana
    CONTROL ENGINEERING AND APPLIED INFORMATICS, 2016, 18 (02): : 57 - 64