Improving the Performance of Distributed TensorFlow with RDMA

被引：23

作者：

Jia, Chengfan ^{[1
]}

Liu, Junnan ^{[1
]}

Jin, Xu ^{[1
]}

Lin, Han ^{[1
]}

An, Hong ^{[1
]}

Han, Wenting ^{[1
]}

Wu, Zheng ^{[1
]}

Chi, Mengxian ^{[1
]}

机构：

[1] Univ Sci & Technol China, Hefei 230026, Anhui, Peoples R China

来源：

INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING | 2018年 / 46卷 / 04期

关键词：

Distributed TensorFlow; RDMA; Infiniband; Optimization;

D O I：

10.1007/s10766-017-0520-3

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

TensorFlow is an open-source software library designed for Deep Learning using dataflow graph computation. Thanks to the flexible architecture of TensorFlow, users can deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API. In a distributed TensorFlow work process, it uses gRPC to connect between different nodes. However, when deploying training tasks on high performance computing clusters, the performance of gRPC becomes a bottleneck of distributed TensorFlow system. HPC clusters are usually equipped with Infiniband network, in addition to traditional TCP/IP network. But open- sourced TensorFlow has not taken this advantage. We present a RDMA-capable design of TensorFlow. By porting the Tensor send/receive parts of TensorFlow into RDMA verbs, we finally get nearly 6x performance improvements over the original distributed TensorFlow, based on gRPC. The TensorFlow system with RDMA support shows a great scalability among the training scale.

引用

页码：674 / 685

页数：12

共 50 条

[21] TensorFlow at Scale: Performance and productivity analysis of distributed training with Horovod, MLSL, and Cray PE ML
Kurth, Thorsten
Smorkalov, Mikhail
Mendygral, Peter
Sridharan, Srinivas
Mathuriya, Amrita
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2019, 31 (16):
[22] Survey on RDMA-Based Distributed Storage Systems
Chen Y.
Lu Y.
Luo S.
Shu J.
Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2019, 56 (02): : 227 - 239
[23] MENPS: A Decentralized Distributed Shared Memory Exploiting RDMA
Endo, Wataru
Sato, Shigeyuki
Taura, Kenjiro
PROCEEDINGS OF FOURTH ANNUAL WORKSHOP ON EMERGING PARALLEL AND DISTRIBUTED RUNTIME SYSTEMS AND MIDDLEWARE (IPDRM 2020), 2020, : 9 - 16
[24] Fast and General Distributed Transactions using RDMA and HTM
Chen, Yanzhe
Wei, Xingda
Shi, Jiaxin
Chen, Rong
Chen, Haibo
PROCEEDINGS OF THE ELEVENTH EUROPEAN CONFERENCE ON COMPUTER SYSTEMS, (EUROSYS 2016), 2016,
[25] Design and Optimization of a Distributed File System Based on RDMA
He, Qinlu
Gao, Pengze
Zhang, Fan
Bian, Genqing
Zhang, Weiqi
Li, Zhen
APPLIED SCIENCES-BASEL, 2023, 13 (15):
[26] Distributed Lock Management with RDMA: Decentralization without Starvation
Yoon, Dong Young
Chowdhury, Mosharaf
Mozafari, Barzan
SIGMOD'18: PROCEEDINGS OF THE 2018 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2018, : 1571 - 1586
[27] High performance RDMA protocols in HPC
Woodall, Tim S.
Shipman, Galen M.
Bosilca, George
Graham, Richard L.
Maccabe, Arthur B.
RECENT ADVANCES IN PARALLEL VIRTUAL MACHINE AND MESSAGE PASSING INTERFACE, 2006, 4192 : 76 - 85
[28] Scalable RDMA performance in PGAS languages
Farreras, Montse
Almasi, George
Cascaval, Calin
Cortes, Toni
2009 IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL & DISTRIBUTED PROCESSING, VOLS 1-5, 2009, : 477 - +
[29] Improving performance of distributed Haskell in Mosix clusters
Collins, L
Gross, M
Whitlock, PA
COMPUTATIONAL SCIENCE - ICCS 2005, PT 3, 2005, 3516 : 983 - 986
[30] Improving Authentication Performance of Distributed SIP Proxies
Dacosta, Italo
Balasubramaniyan, Vijay
Ahamad, Mustaque
Traynor, Patrick
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2011, 22 (11) : 1804 - 1812

← 1 2 3 4 5 →