Improving the Performance of Distributed TensorFlow with RDMA

被引:23
|
作者
Jia, Chengfan [1 ]
Liu, Junnan [1 ]
Jin, Xu [1 ]
Lin, Han [1 ]
An, Hong [1 ]
Han, Wenting [1 ]
Wu, Zheng [1 ]
Chi, Mengxian [1 ]
机构
[1] Univ Sci & Technol China, Hefei 230026, Anhui, Peoples R China
关键词
Distributed TensorFlow; RDMA; Infiniband; Optimization;
D O I
10.1007/s10766-017-0520-3
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
TensorFlow is an open-source software library designed for Deep Learning using dataflow graph computation. Thanks to the flexible architecture of TensorFlow, users can deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API. In a distributed TensorFlow work process, it uses gRPC to connect between different nodes. However, when deploying training tasks on high performance computing clusters, the performance of gRPC becomes a bottleneck of distributed TensorFlow system. HPC clusters are usually equipped with Infiniband network, in addition to traditional TCP/IP network. But open- sourced TensorFlow has not taken this advantage. We present a RDMA-capable design of TensorFlow. By porting the Tensor send/receive parts of TensorFlow into RDMA verbs, we finally get nearly 6x performance improvements over the original distributed TensorFlow, based on gRPC. The TensorFlow system with RDMA support shows a great scalability among the training scale.
引用
收藏
页码:674 / 685
页数:12
相关论文
共 50 条
  • [21] TensorFlow at Scale: Performance and productivity analysis of distributed training with Horovod, MLSL, and Cray PE ML
    Kurth, Thorsten
    Smorkalov, Mikhail
    Mendygral, Peter
    Sridharan, Srinivas
    Mathuriya, Amrita
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2019, 31 (16):
  • [22] Survey on RDMA-Based Distributed Storage Systems
    Chen Y.
    Lu Y.
    Luo S.
    Shu J.
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2019, 56 (02): : 227 - 239
  • [23] MENPS: A Decentralized Distributed Shared Memory Exploiting RDMA
    Endo, Wataru
    Sato, Shigeyuki
    Taura, Kenjiro
    PROCEEDINGS OF FOURTH ANNUAL WORKSHOP ON EMERGING PARALLEL AND DISTRIBUTED RUNTIME SYSTEMS AND MIDDLEWARE (IPDRM 2020), 2020, : 9 - 16
  • [24] Fast and General Distributed Transactions using RDMA and HTM
    Chen, Yanzhe
    Wei, Xingda
    Shi, Jiaxin
    Chen, Rong
    Chen, Haibo
    PROCEEDINGS OF THE ELEVENTH EUROPEAN CONFERENCE ON COMPUTER SYSTEMS, (EUROSYS 2016), 2016,
  • [25] Design and Optimization of a Distributed File System Based on RDMA
    He, Qinlu
    Gao, Pengze
    Zhang, Fan
    Bian, Genqing
    Zhang, Weiqi
    Li, Zhen
    APPLIED SCIENCES-BASEL, 2023, 13 (15):
  • [26] Distributed Lock Management with RDMA: Decentralization without Starvation
    Yoon, Dong Young
    Chowdhury, Mosharaf
    Mozafari, Barzan
    SIGMOD'18: PROCEEDINGS OF THE 2018 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2018, : 1571 - 1586
  • [27] High performance RDMA protocols in HPC
    Woodall, Tim S.
    Shipman, Galen M.
    Bosilca, George
    Graham, Richard L.
    Maccabe, Arthur B.
    RECENT ADVANCES IN PARALLEL VIRTUAL MACHINE AND MESSAGE PASSING INTERFACE, 2006, 4192 : 76 - 85
  • [28] Scalable RDMA performance in PGAS languages
    Farreras, Montse
    Almasi, George
    Cascaval, Calin
    Cortes, Toni
    2009 IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL & DISTRIBUTED PROCESSING, VOLS 1-5, 2009, : 477 - +
  • [29] Improving performance of distributed Haskell in Mosix clusters
    Collins, L
    Gross, M
    Whitlock, PA
    COMPUTATIONAL SCIENCE - ICCS 2005, PT 3, 2005, 3516 : 983 - 986
  • [30] Improving Authentication Performance of Distributed SIP Proxies
    Dacosta, Italo
    Balasubramaniyan, Vijay
    Ahamad, Mustaque
    Traynor, Patrick
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2011, 22 (11) : 1804 - 1812