Improving the Performance of Distributed TensorFlow with RDMA

被引：23

作者：

Jia, Chengfan ^{[1
]}

Liu, Junnan ^{[1
]}

Jin, Xu ^{[1
]}

Lin, Han ^{[1
]}

An, Hong ^{[1
]}

Han, Wenting ^{[1
]}

Wu, Zheng ^{[1
]}

Chi, Mengxian ^{[1
]}

机构：

[1] Univ Sci & Technol China, Hefei 230026, Anhui, Peoples R China

来源：

INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING | 2018年 / 46卷 / 04期

关键词：

Distributed TensorFlow; RDMA; Infiniband; Optimization;

D O I：

10.1007/s10766-017-0520-3

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

TensorFlow is an open-source software library designed for Deep Learning using dataflow graph computation. Thanks to the flexible architecture of TensorFlow, users can deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API. In a distributed TensorFlow work process, it uses gRPC to connect between different nodes. However, when deploying training tasks on high performance computing clusters, the performance of gRPC becomes a bottleneck of distributed TensorFlow system. HPC clusters are usually equipped with Infiniband network, in addition to traditional TCP/IP network. But open- sourced TensorFlow has not taken this advantage. We present a RDMA-capable design of TensorFlow. By porting the Tensor send/receive parts of TensorFlow into RDMA verbs, we finally get nearly 6x performance improvements over the original distributed TensorFlow, based on gRPC. The TensorFlow system with RDMA support shows a great scalability among the training scale.

引用

页码：674 / 685

页数：12

共 50 条

[41] Gengar: An RDMA-based Distributed Hybrid Memory Pool
Duan, Zhuohui
Liu, Haikun
Lu, Haodi
Liao, Xiaofei
Jin, Hai
Zhang, Yu
He, Bingsheng
2021 IEEE 41ST INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS 2021), 2021, : 92 - 103
[42] RDMA vs. RPC for Implementing Distributed Data Structures
Brock, Benjamin
Chen, Yuxin
Yan, Jiakun
Owens, John D.
Buluc, Aydin
Yelick, Katherine
2019 IEEE/ACM 9TH WORKSHOP ON IRREGULAR APPLICATIONS - ARCHITECTURES AND ALGORITHMS (IA3), 2019, : 17 - 22
[43] Distributed Analytics in Fog Computing Platforms Using TensorFlow and Kubernetes
Tsai, Pei-Hsuan
Hong, Hua-Jun
Cheng, An-Chieh
Hsu, Cheng-Hsin
2017 19TH ASIA-PACIFIC NETWORK OPERATIONS AND MANAGEMENT SYMPOSIUM (APNOMS 2017): MANAGING A WORLD OF THINGS, 2017, : 145 - 150
[44] Collie: Finding Performance Anomalies in RDMA Subsystems
Kong, Xinhao
Zhu, Yibo
Zhou, Huaping
Jiang, Zhuo
Ye, Jianxi
Guo, Chuanxiong
Zhuo, Danyang
PROCEEDINGS OF THE 19TH USENIX SYMPOSIUM ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION (NSDI '22), 2022, : 287 - 305
[45] DeepMarket: An Edge Computing Marketplace with Distributed TensorFlow Execution Capability
Yerabolu, Susham
Kim, Soyoung
Gomena, Samuel
Li, Xuanzhe
Patel, Rohan
Bhise, Shraddha
Aryafar, Ehsan
IEEE CONFERENCE ON COMPUTER COMMUNICATIONS WORKSHOPS (IEEE INFOCOM 2019 WKSHPS), 2019, : 32 - 37
[46] Understanding RDMA Microarchitecture Resources for Performance Isolation
Kong, Xinhao
Chen, Jingrong
Bai, Wei
Xu, Yechen
Elhaddad, Mahmoud
Raindel, Shachar
Padhye, Jitendra
Lebeck, Alvin R.
Zhuo, Danyang
PROCEEDINGS OF THE 20TH USENIX SYMPOSIUM ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION, NSDI 2023, 2023, : 31 - 48
[47] Design Guidelines for High Performance RDMA Systems
Kalia, Anuj
Kaminsky, Michael
Andersen, David G.
PROCEEDINGS OF USENIX ATC '16: 2016 USENIX ANNUAL TECHNICAL CONFERENCE, 2016, : 437 - 450
[48] Optimal distributed parallel algorithms for deep learning framework Tensorflow
Yuanlun Xie
Majun He
Tingsong Ma
Wenhong Tian
Applied Intelligence, 2022, 52 : 3880 - 3900
[49] Optimal distributed parallel algorithms for deep learning framework Tensorflow
Xie, Yuanlun
He, Majun
Ma, Tingsong
Tian, Wenhong
APPLIED INTELLIGENCE, 2022, 52 (04) : 3880 - 3900
[50] Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation
Awan, Ammar Ahmad
Chu, Ching-Hsiang
Subramoni, Hari
Panda, Dhabaleswar K.
Bedorf, Jeroen
2019 19TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID), 2019, : 498 - 507

← 1 2 3 4 5 →