Performance and Consistency Analysis for Distributed Deep Learning Applications

被引：0

作者：

Jia, Danlin ^{[1
]}

Saha, Manoj Pravakar ^{[2
]}

Bhimani, Janki ^{[2
]}

Mi, Ningfang ^{[1
]}

机构：

[1] Northeastern Univ, Boston, MA 02115 USA

[2] Florida Int Univ, Miami, FL 33199 USA

来源：

2020 IEEE 39TH INTERNATIONAL PERFORMANCE COMPUTING AND COMMUNICATIONS CONFERENCE (IPCCC) | 2020年

基金：

美国国家科学基金会;

关键词：

D O I：

10.1109/IPCCC50635.2020.9391566

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Accelerating the training of Deep Neural Network (DNN) models is very important for successfully using deep learning techniques in fields like computer vision and speech recognition. Distributed frameworks help to speed up the training process for large DNN models and datasets. Plenty of works have been done to improve model accuracy and training efficiency, based on mathematical analysis of computations in the Convolutional Neural Networks (CNN). However, to run distributed deep learning applications in the real world, users and developers need to consider the impacts of system resource distribution. In this work, we deploy a real distributed deep learning cluster with multiple virtual machines. We conduct an in-depth analysis to understand the impacts of system configurations, distribution typologies, and application parameters, on the latency and correctness of the distributed deep learning applications. We analyze the performance diversity under different model consistency and data parallelism by profiling run-time system utilization and tracking application activities. Based on our observations and analysis, we develop design guidelines for accelerating distributed deep-learning training on virtualized environments.

引用

页数：8

共 50 条

[31] Applications of deep learning for the analysis of medical data
Jang, Hyun-Jong
Cho, Kyung-Ok
ARCHIVES OF PHARMACAL RESEARCH, 2019, 42 (06) : 492 - 504
[32] Deep learning applications for kidney histology analysis
Pilva, Pourya
Buelow, Roman
Boor, Peter
CURRENT OPINION IN NEPHROLOGY AND HYPERTENSION, 2024, 33 (03): : 291 - 297
[33] Applications of deep learning for the analysis of medical data
Hyun-Jong Jang
Kyung-Ok Cho
Archives of Pharmacal Research, 2019, 42 : 492 - 504
[34] Consistency in models for communication constrained distributed learning
Predd, JB
Kulkarni, SR
Poor, HV
LEARNING THEORY, PROCEEDINGS, 2004, 3120 : 442 - 456
[35] Gradient Learning With the Mode-Induced Loss: Consistency Analysis and Applications
Chen, Hong
Fu, Youcheng
Jiang, Xue
Chen, Yanhong
Li, Weifu
Zhou, Yicong
Zheng, Feng
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (07) : 9686 - 9699
[36] No compromises: distributed transactions with consistency, availability, and performance
Dragojevic, Aleksandar
Narayanan, Dushyanth
Nightingale, Edmund B.
Renzelmann, Matthew
Shamis, Alex
Badam, Anirudh
Castro, Miguel
SOSP'15: PROCEEDINGS OF THE TWENTY-FIFTH ACM SYMPOSIUM ON OPERATING SYSTEMS PRINCIPLES, 2015, : 54 - 70
[37] Empirical Performance Analysis of Collective Communication for Distributed Deep Learning in a Many-Core CPU Environment
Woo, Junghoon
Choi, Hyeonseong
Lee, Jaehwan
APPLIED SCIENCES-BASEL, 2020, 10 (19):
[38] Performance evaluation of the distributed object consistency protocol
Perret, Stephane
Dilley, John
Arlitt, Martin
HP Laboratories Technical Report, 1999, (108):
[39] Aksum: A performance analysis tool for parallel and distributed applications
Fahringer, T
Seragiotto, C
PERFORMANCE ANALYSIS AND GRID COMPUTING, 2004, : 189 - 208
[40] Different approaches to automatic performance analysis of distributed applications
Margalef, T
Jorba, J
Morajko, O
Morajko, A
Luque, E
PERFORMANCE ANALYSIS AND GRID COMPUTING, 2004, : 3 - 19

← 1 2 3 4 5 →