Improving the Performance of Distributed TensorFlow with RDMA

被引:23
|
作者
Jia, Chengfan [1 ]
Liu, Junnan [1 ]
Jin, Xu [1 ]
Lin, Han [1 ]
An, Hong [1 ]
Han, Wenting [1 ]
Wu, Zheng [1 ]
Chi, Mengxian [1 ]
机构
[1] Univ Sci & Technol China, Hefei 230026, Anhui, Peoples R China
关键词
Distributed TensorFlow; RDMA; Infiniband; Optimization;
D O I
10.1007/s10766-017-0520-3
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
TensorFlow is an open-source software library designed for Deep Learning using dataflow graph computation. Thanks to the flexible architecture of TensorFlow, users can deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API. In a distributed TensorFlow work process, it uses gRPC to connect between different nodes. However, when deploying training tasks on high performance computing clusters, the performance of gRPC becomes a bottleneck of distributed TensorFlow system. HPC clusters are usually equipped with Infiniband network, in addition to traditional TCP/IP network. But open- sourced TensorFlow has not taken this advantage. We present a RDMA-capable design of TensorFlow. By porting the Tensor send/receive parts of TensorFlow into RDMA verbs, we finally get nearly 6x performance improvements over the original distributed TensorFlow, based on gRPC. The TensorFlow system with RDMA support shows a great scalability among the training scale.
引用
收藏
页码:674 / 685
页数:12
相关论文
共 50 条
  • [31] Improving the performance of distributed virtual environment systems
    Morillo, P
    Orduña, JM
    Fernández, M
    Duato, J
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2005, 16 (07) : 637 - 649
  • [32] Improving Performance in Component Based Distributed Systems
    Al-Wesabi, Fahd N.
    Iskandar, Huda G.
    Ghilan, Mokhtar M.
    EAI ENDORSED TRANSACTIONS ON SCALABLE INFORMATION SYSTEMS, 2019, 6 (22): : 1 - 7
  • [33] Improving Dependability of Onboard Deep Learning with Resilient TensorFlow
    Garrett, Tyler
    George, Alan D.
    2021 IEEE SPACE COMPUTING CONFERENCE (SCC), 2021, : 134 - 142
  • [34] Improving performance in distributed embodied evolution: Distributed Differential Embodied Evolution
    Trueba, Pedro
    Prieto, Abraham
    2018 CONFERENCE ON ARTIFICIAL LIFE (ALIFE 2018), 2018, : 222 - 223
  • [35] A Distributed Persistent Memory File System Based on RDMA Multicast
    Chen M.
    Zheng S.
    You L.
    Wang J.
    Yan T.
    Tu Y.
    Han Y.
    Huang L.
    Zheng, Sheng'an (venero@tsinghua.edu.cn), 1600, Science Press (58): : 384 - 396
  • [36] RDMA-based Cooperative Caching for a Distributed File System
    Sasaki, Shin
    Matsumiya, Ryo
    Takahashi, Kazushi
    Oyama, Yoshihiro
    Tatebe, Osamu
    2015 IEEE 21ST INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS), 2015, : 344 - 353
  • [37] iRDMA: Efficient Use of RDMA in Distributed Deep Learning Systems
    Ren, Yufei
    Wu, Xingbo
    Zhang, Li
    Wang, Yandong
    Zhang, Wei
    Wang, Zijun
    Hack, Michel
    Jiang, Song
    2017 19TH IEEE INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS (HPCC) / 2017 15TH IEEE INTERNATIONAL CONFERENCE ON SMART CITY (SMARTCITY) / 2017 3RD IEEE INTERNATIONAL CONFERENCE ON DATA SCIENCE AND SYSTEMS (DSS), 2017, : 231 - 238
  • [38] FastStore: A High-Performance RDMA-enabled Distributed Key-Value Store with Persistent Memory
    Xiong, Ziwei
    Jiang, Dejun
    Xiong, Jin
    2023 IEEE 43RD INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS, ICDCS, 2023, : 406 - 417
  • [39] Deconstructing RDMA-enabled Distributed Transactions: Hybrid is Better!
    Wei, Xingda
    Dong, Zhiyuan
    Chen, Rong
    Chen, Haibo
    PROCEEDINGS OF THE 13TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, 2018, : 233 - 251
  • [40] RDMA over Ethernet for Distributed AI Training at Meta Scale
    Gangidi, Adithya
    Miao, Rui
    Zheng, Shengbao
    Bondu, Sai Jayesh
    Goes, Guilherme
    Morsy, Hany
    Puri, Rohit
    Riftadi, Mohammad
    Shetty, Ashmitha Jeevaraj
    Yang, Jingyi
    Zhang, Shuqiang
    Fernandez, Mikel Jimenez
    Gandham, Shashidhar
    Zeng, Hongyi
    PROCEEDINGS OF THE 2024 ACM SIGCOMM 2024 CONFERENCE, ACM SIGCOMM 2024, 2024, : 57 - 70