Improving the Performance of Distributed MXNet with RDMA

被引：0

作者：

Mingfan Li

Ke Wen

Han Lin

Xu Jin

Zheng Wu

Hong An

Mengxian Chi

机构：

[1] University of Science and Technology of China,

来源：

International Journal of Parallel Programming | 2019年 / 47卷

关键词：

Distributed MXNet; Parameter server; RDMA; InfiniBand; Network optimization;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

As one of the most influential deep learning frameworks, MXNet has achieved excellent performance and many breakthroughs in academic and industrial fields for various machine learning situations. The initial implementation of MXNet uses proxy-socket interface, which delivers suboptimal performance in distributed environment. In a massive parallel training task, parameters are updated frequently during each training loop, in which case network performance becomes the main factor of overall performance. Over the past decade, high performance interconnects have employed remote direct memory access (RDMA) technology to provide excellent performance for numerous scientific domains. In this paper, we describe an efficient design that extends the open-source MXNet to make it RDMA capable via RDMA-based parameter server interfaces. With modest optimizations towards memory usage and transmission overhead, RDMA-based MXNet achieves great performance improvement over the original software. Our experiments reveal that, for the communication subsystem of MXNet, the new design achieves 16x speedup (up to 21x at peak) over 1 Gigabit Ethernet (1GigE). For the two training cases on MXNet, the optimized implementation gains 5x and 9x speedup, respectively. Compared to experiments on the IP-over-InfiniBand (IPoIB) protocol, it achieves nearly 30% performance improvement, as well as better scalability and alleviation of bottlenecks.

引用

页码：467 / 480

页数：13

共 50 条

[1] Improving the Performance of Distributed MXNet with RDMA
Li, Mingfan
Wen, Ke
Lin, Han
Jin, Xu
Wu, Zheng
An, Hong
Chi, Mengxian
INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2019, 47 (03) : 467 - 480
[2] Improving the Performance of Distributed TensorFlow with RDMA
Chengfan Jia
Junnan Liu
Xu Jin
Han Lin
Hong An
Wenting Han
Zheng Wu
Mengxian Chi
International Journal of Parallel Programming, 2018, 46 : 674 - 685
[3] Improving the Performance of Distributed TensorFlow with RDMA
Jia, Chengfan
Liu, Junnan
Jin, Xu
Lin, Han
An, Hong
Han, Wenting
Wu, Zheng
Chi, Mengxian
INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2018, 46 (04) : 674 - 685
[4] RM-KVStore: New MXNet KVStore to Accelerate Transfer Performance with RDMA
Lv, Baocai
Liu, Bing
Liu, Fang
Xiao, Nong
Chen, Zhiguang
2018 IEEE SYMPOSIUM ON COMPUTERS AND COMMUNICATIONS (ISCC), 2018, : 241 - 247
[5] MC-RDMA: Improving Replication Performance of RDMA-based Distributed Systems with Reliable Multicast Support
Huang, Chengyuan
Gao, Yixiao
Chen, Wei
Li, Duoxing
Xiao, Yibo
Zhang, Ruyi
Tian, Chen
Wang, Xiaoliang
Dou, Wanchun
Chen, Guihai
Wang, Yi
Xiao, Fu
2023 IEEE 31ST INTERNATIONAL CONFERENCE ON NETWORK PROTOCOLS, ICNP, 2023,
[6] Impact of RDMA Communication on the Performance of Distributed BFS Algorithm
Guney, Isa Ahmet
Ovant, Burak Sezin
Baydere, Sebnem
2016 INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING & SIMULATION (HPCS 2016), 2016, : 350 - 356
[7] DArray: A High Performance RDMA-Based Distributed Array
Ding, Baorong
Han, Mingcong
Chen, Rong
PROCEEDINGS OF THE 52ND INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2023, 2023, : 715 - 724
[8] A Survey of RDMA Distributed Storage
Wang, Ziqi
Liu, Yaping
Zhang, Shuo
Hu, Jinrui
Liu, Xinyi
2024 5TH INTERNATIONAL CONFERENCE ON COMPUTING, NETWORKS AND INTERNET OF THINGS, CNIOT 2024, 2024, : 534 - 539
[9] Protocol Customization for Improving MPI Performance on RDMA-Enabled Clusters
Gu, Zheng
Small, Matthew
Yuan, Xin
Marathe, Aniruddha
Lowenthal, David K.
INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2013, 41 (05) : 682 - 703
[10] Protocol Customization for Improving MPI Performance on RDMA-Enabled Clusters
Zheng Gu
Matthew Small
Xin Yuan
Aniruddha Marathe
David K. Lowenthal
International Journal of Parallel Programming, 2013, 41 : 682 - 703

← 1 2 3 4 5 →