An Optimal Network-Aware Scheduling Technique for Distributed Deep Learning in Distributed HPC Platforms

被引:1
|
作者
Lee, Sangkwon [1 ]
Shah, Syed Asif Raza [2 ,3 ]
Seok, Woojin [4 ]
Moon, Jeonghoon [3 ]
Kim, Kihyeon [3 ]
Shah, Syed Hasnain Raza [1 ]
机构
[1] Univ Sci & Technol, Sci & Technol Informat Sci, Daejeon 34113, South Korea
[2] Sukkur IBA Univ, Dept Comp Sci, CRAIB, Sukkur 65200, Pakistan
[3] Korea Inst Sci & Technol Informat, KREONET, Daejeon 34141, South Korea
[4] Korea Inst Sci & Technol Informat, Ctr Quantum Commun, Daejeon 34141, South Korea
关键词
cloud computing; scheduling; container technology; distributed computing; network monitoring; deep learning; distributed HPC; AI;
D O I
10.3390/electronics12143021
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Deep learning is a growing technique used to solve complex artificial intelligence (AI) problems. Large-scale deep learning has become a significant issue as a result of the expansion of datasets and the complexity of deep learning models. For training large-scale models, the cloud can be used as a distributed HPC (high-performance computing) tool with benefits in cost and flexibility. However, one of the major performance barriers in distributed deep learning in a distributed HPC environment is the network. The performance is often limited by heavy traffic like many stochastic gradient descent transfers for distributed communication. There are many network studies in distributed deep learning to solve these problems, but most research only focuses on improving communication performance and applying new methods or algorithms like overlapping parameter synchronization to minimize communication delay rather than considering the actual network. In this paper, we are focusing on the actual network, especially in a distributed HPC environment. In such an environment, if cluster nodes are assigned to different zones/regions which means a set of an appropriate number of distributed HPC nodes when performing distributed deep learning tasks, performance degradation due to network delay may occur. The proposed network optimization algorithm ensures that distributed work is placed in the same zone as much as possible to reduce network delay. Furthermore, scoring using network monitoring tools like loss, delay, and throughput is applied to select the optimal node within the zone. Our proposal has been validated on the Kubernetes platform, an open source orchestrator for the automatic management and deployment of micro-services. The performance of distributed deep learning is improved through the proposed scheduler.
引用
收藏
页数:18
相关论文
共 50 条
  • [31] Introducing Network-Aware Scheduling Capabilities in OpenStack
    Lucrezia, Francesco
    Marchetto, Guido
    Risso, Fulvio
    Vercellone, Vinicio
    2015 1ST IEEE CONFERENCE ON NETWORK SOFTWARIZATION (NETSOFT), 2015,
  • [32] Network-aware container scheduling in edge computing
    Qiao, Ying
    Xiong, Junhan
    Zhao, Yiguo
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2025, 28 (02):
  • [33] Straggler-Aware In-Network Aggregation for Accelerating Distributed Deep Learning
    Lee, Hochan
    Lee, Jaewook
    Kim, Heewon
    Pack, Sangheon
    IEEE TRANSACTIONS ON SERVICES COMPUTING, 2023, 16 (06) : 4198 - 4204
  • [34] Reinforcement Learning based Fragment-Aware Scheduling for High Utilization HPC Platforms
    Chen, Lung-Pin
    Wu, I-Chen
    Chang, Yen-Ling
    2019 INTERNATIONAL CONFERENCE ON TECHNOLOGIES AND APPLICATIONS OF ARTIFICIAL INTELLIGENCE (TAAI), 2019,
  • [35] Data-Aware Scheduling of Legacy Kernels on Heterogeneous Platforms with Distributed Memory
    Becchi, Michela
    Byna, Surendra
    Cadambi, Srihari
    Chakradhar, Srimat
    SPAA '10: PROCEEDINGS OF THE TWENTY-SECOND ANNUAL SYMPOSIUM ON PARALLELISM IN ALGORITHMS AND ARCHITECTURES, 2010, : 82 - 91
  • [36] Network-Aware Distributed Algorithms: Challenges and Opportunities in Wireless Networks (Invited Lecture Summary)
    Vaidya, Nitin
    DISTRIBUTED COMPUTING, 2010, 6343 : 343 - 343
  • [37] Understanding Distributed Deep Learning Performance by Correlating HPC and Machine Learning Measurements
    Veroneze Solorzano, Ana Luisa
    Schnorr, Lucas Mello
    HIGH PERFORMANCE COMPUTING, ISC HIGH PERFORMANCE 2022, 2022, 13289 : 275 - 292
  • [38] Distributed Situation-Aware Scheduling Algorithm for Network Navigation
    Wang, Tianheng
    Teague, Bryan
    Win, Moe Z.
    2017 IEEE 17TH INTERNATIONAL CONFERENCE ON UBIQUITOUS WIRELESS BROADBAND (ICUWB), 2017,
  • [39] Nitro: Network-Aware Virtual Machine Image Management in Geo-Distributed Clouds
    Darrous, Jad
    Ibrahim, Shadi
    Zhou, Amelie Chi
    Perez, Christian
    2018 18TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID), 2018, : 553 - 562
  • [40] Deep Learning-Based Optimal Scheduling Scheme for Distributed Wind Power Systems
    Wang, Jing
    Wei, Xiongfei
    Fang, Yuanjie
    Zhang, Pinggai
    Juanatas, Ronaldo
    Caballero, Jonathan M.
    Niguidula, Jasmin D.
    JOURNAL OF CIRCUITS SYSTEMS AND COMPUTERS, 2024, 33 (15)