An Optimal Network-Aware Scheduling Technique for Distributed Deep Learning in Distributed HPC Platforms

被引:1
|
作者
Lee, Sangkwon [1 ]
Shah, Syed Asif Raza [2 ,3 ]
Seok, Woojin [4 ]
Moon, Jeonghoon [3 ]
Kim, Kihyeon [3 ]
Shah, Syed Hasnain Raza [1 ]
机构
[1] Univ Sci & Technol, Sci & Technol Informat Sci, Daejeon 34113, South Korea
[2] Sukkur IBA Univ, Dept Comp Sci, CRAIB, Sukkur 65200, Pakistan
[3] Korea Inst Sci & Technol Informat, KREONET, Daejeon 34141, South Korea
[4] Korea Inst Sci & Technol Informat, Ctr Quantum Commun, Daejeon 34141, South Korea
关键词
cloud computing; scheduling; container technology; distributed computing; network monitoring; deep learning; distributed HPC; AI;
D O I
10.3390/electronics12143021
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Deep learning is a growing technique used to solve complex artificial intelligence (AI) problems. Large-scale deep learning has become a significant issue as a result of the expansion of datasets and the complexity of deep learning models. For training large-scale models, the cloud can be used as a distributed HPC (high-performance computing) tool with benefits in cost and flexibility. However, one of the major performance barriers in distributed deep learning in a distributed HPC environment is the network. The performance is often limited by heavy traffic like many stochastic gradient descent transfers for distributed communication. There are many network studies in distributed deep learning to solve these problems, but most research only focuses on improving communication performance and applying new methods or algorithms like overlapping parameter synchronization to minimize communication delay rather than considering the actual network. In this paper, we are focusing on the actual network, especially in a distributed HPC environment. In such an environment, if cluster nodes are assigned to different zones/regions which means a set of an appropriate number of distributed HPC nodes when performing distributed deep learning tasks, performance degradation due to network delay may occur. The proposed network optimization algorithm ensures that distributed work is placed in the same zone as much as possible to reduce network delay. Furthermore, scoring using network monitoring tools like loss, delay, and throughput is applied to select the optimal node within the zone. Our proposal has been validated on the Kubernetes platform, an open source orchestrator for the automatic management and deployment of micro-services. The performance of distributed deep learning is improved through the proposed scheduler.
引用
收藏
页数:18
相关论文
共 50 条
  • [1] A distributed network-aware TSCH scheduling
    Vieira Junior, Ivanilson Franca
    Granjal, Jorge
    Curado, Marilia
    2023 19TH INTERNATIONAL CONFERENCE ON THE DESIGN OF RELIABLE COMMUNICATION NETWORKS, DRCN, 2023,
  • [2] Network-Aware Optimization of Distributed Learning for Fog Computing
    Wang, Su
    Ruan, Yichen
    Tu, Yuwei
    Wagle, Satyavrat
    Brinton, Christopher G.
    Joe-Wong, Carlee
    IEEE-ACM TRANSACTIONS ON NETWORKING, 2021, 29 (05) : 2019 - 2032
  • [3] Network-Aware Distributed Machine Learning OverWide Area Network
    Zhou, Pan
    Sun, Gang
    Yu, Hongfang
    Chang, Victor
    MODERN INDUSTRIAL IOT, BIG DATA AND SUPPLY CHAIN, IIOTBDSC 2020, 2021, 218 : 55 - 62
  • [4] Network-Aware Optimization of Distributed Learning for Fog Computing
    Tu, Yuwei
    Ruan, Yichen
    Wagle, Satyavrat
    Brinton, Christopher G.
    Joe-Wong, Carlee
    IEEE INFOCOM 2020 - IEEE CONFERENCE ON COMPUTER COMMUNICATIONS, 2020, : 2509 - 2518
  • [5] Network-Aware Locality Scheduling for Distributed Data Operators in Data Centers
    Cheng, Long
    Wang, Ying
    Liu, Qingzhi
    Epema, Dick H. J.
    Liu, Cheng
    Mao, Ying
    Murphy, John
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2021, 32 (06) : 1494 - 1510
  • [6] A Distributed Cross-Entropy Ant Algorithm for Network-Aware Grid Scheduling
    Yi, Hu
    Bin, Gong
    JCPC: 2009 JOINT CONFERENCE ON PERVASIVE COMPUTING, 2009, : 253 - 256
  • [7] Network-aware distributed computing: A case study
    Tangmunarunkit, H
    Steenkiste, P
    PARALLEL AND DISTRIBUTED PROCESSING, 1998, 1388 : 171 - 182
  • [8] Network-aware support for mobile distributed teams
    van der Kleij, Rick
    de Jong, Alexis
    te Brake, Guido
    de Greef, Tjerk
    COMPUTERS IN HUMAN BEHAVIOR, 2009, 25 (04) : 940 - 948
  • [9] Network-aware optimization of communications for parallel matrix multiplication on hierarchical HPC platforms
    Malik, Tania
    Rychkov, Vladimir
    Lastovetsky, Alexey
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2016, 28 (03): : 802 - 821
  • [10] Network-Aware Service Placement in a Distributed Cloud Environment
    Steiner, Moritz
    Gaglianello, Bob
    Gurbani, Vijay
    Hilt, Volker
    Roome, W. D.
    Scharf, Michael
    Voith, Thomas
    ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2012, 42 (04) : 73 - 74