Performance and Consistency Analysis for Distributed Deep Learning Applications

被引:0
|
作者
Jia, Danlin [1 ]
Saha, Manoj Pravakar [2 ]
Bhimani, Janki [2 ]
Mi, Ningfang [1 ]
机构
[1] Northeastern Univ, Boston, MA 02115 USA
[2] Florida Int Univ, Miami, FL 33199 USA
基金
美国国家科学基金会;
关键词
D O I
10.1109/IPCCC50635.2020.9391566
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Accelerating the training of Deep Neural Network (DNN) models is very important for successfully using deep learning techniques in fields like computer vision and speech recognition. Distributed frameworks help to speed up the training process for large DNN models and datasets. Plenty of works have been done to improve model accuracy and training efficiency, based on mathematical analysis of computations in the Convolutional Neural Networks (CNN). However, to run distributed deep learning applications in the real world, users and developers need to consider the impacts of system resource distribution. In this work, we deploy a real distributed deep learning cluster with multiple virtual machines. We conduct an in-depth analysis to understand the impacts of system configurations, distribution typologies, and application parameters, on the latency and correctness of the distributed deep learning applications. We analyze the performance diversity under different model consistency and data parallelism by profiling run-time system utilization and tracking application activities. Based on our observations and analysis, we develop design guidelines for accelerating distributed deep-learning training on virtualized environments.
引用
收藏
页数:8
相关论文
共 50 条
  • [31] Applications of deep learning for the analysis of medical data
    Jang, Hyun-Jong
    Cho, Kyung-Ok
    ARCHIVES OF PHARMACAL RESEARCH, 2019, 42 (06) : 492 - 504
  • [32] Deep learning applications for kidney histology analysis
    Pilva, Pourya
    Buelow, Roman
    Boor, Peter
    CURRENT OPINION IN NEPHROLOGY AND HYPERTENSION, 2024, 33 (03): : 291 - 297
  • [33] Applications of deep learning for the analysis of medical data
    Hyun-Jong Jang
    Kyung-Ok Cho
    Archives of Pharmacal Research, 2019, 42 : 492 - 504
  • [34] Consistency in models for communication constrained distributed learning
    Predd, JB
    Kulkarni, SR
    Poor, HV
    LEARNING THEORY, PROCEEDINGS, 2004, 3120 : 442 - 456
  • [35] Gradient Learning With the Mode-Induced Loss: Consistency Analysis and Applications
    Chen, Hong
    Fu, Youcheng
    Jiang, Xue
    Chen, Yanhong
    Li, Weifu
    Zhou, Yicong
    Zheng, Feng
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (07) : 9686 - 9699
  • [36] No compromises: distributed transactions with consistency, availability, and performance
    Dragojevic, Aleksandar
    Narayanan, Dushyanth
    Nightingale, Edmund B.
    Renzelmann, Matthew
    Shamis, Alex
    Badam, Anirudh
    Castro, Miguel
    SOSP'15: PROCEEDINGS OF THE TWENTY-FIFTH ACM SYMPOSIUM ON OPERATING SYSTEMS PRINCIPLES, 2015, : 54 - 70
  • [37] Empirical Performance Analysis of Collective Communication for Distributed Deep Learning in a Many-Core CPU Environment
    Woo, Junghoon
    Choi, Hyeonseong
    Lee, Jaehwan
    APPLIED SCIENCES-BASEL, 2020, 10 (19):
  • [38] Performance evaluation of the distributed object consistency protocol
    Perret, Stephane
    Dilley, John
    Arlitt, Martin
    HP Laboratories Technical Report, 1999, (108):
  • [39] Aksum: A performance analysis tool for parallel and distributed applications
    Fahringer, T
    Seragiotto, C
    PERFORMANCE ANALYSIS AND GRID COMPUTING, 2004, : 189 - 208
  • [40] Different approaches to automatic performance analysis of distributed applications
    Margalef, T
    Jorba, J
    Morajko, O
    Morajko, A
    Luque, E
    PERFORMANCE ANALYSIS AND GRID COMPUTING, 2004, : 3 - 19