An Optimal Network-Aware Scheduling Technique for Distributed Deep Learning in Distributed HPC Platforms

被引:1
|
作者
Lee, Sangkwon [1 ]
Shah, Syed Asif Raza [2 ,3 ]
Seok, Woojin [4 ]
Moon, Jeonghoon [3 ]
Kim, Kihyeon [3 ]
Shah, Syed Hasnain Raza [1 ]
机构
[1] Univ Sci & Technol, Sci & Technol Informat Sci, Daejeon 34113, South Korea
[2] Sukkur IBA Univ, Dept Comp Sci, CRAIB, Sukkur 65200, Pakistan
[3] Korea Inst Sci & Technol Informat, KREONET, Daejeon 34141, South Korea
[4] Korea Inst Sci & Technol Informat, Ctr Quantum Commun, Daejeon 34141, South Korea
关键词
cloud computing; scheduling; container technology; distributed computing; network monitoring; deep learning; distributed HPC; AI;
D O I
10.3390/electronics12143021
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Deep learning is a growing technique used to solve complex artificial intelligence (AI) problems. Large-scale deep learning has become a significant issue as a result of the expansion of datasets and the complexity of deep learning models. For training large-scale models, the cloud can be used as a distributed HPC (high-performance computing) tool with benefits in cost and flexibility. However, one of the major performance barriers in distributed deep learning in a distributed HPC environment is the network. The performance is often limited by heavy traffic like many stochastic gradient descent transfers for distributed communication. There are many network studies in distributed deep learning to solve these problems, but most research only focuses on improving communication performance and applying new methods or algorithms like overlapping parameter synchronization to minimize communication delay rather than considering the actual network. In this paper, we are focusing on the actual network, especially in a distributed HPC environment. In such an environment, if cluster nodes are assigned to different zones/regions which means a set of an appropriate number of distributed HPC nodes when performing distributed deep learning tasks, performance degradation due to network delay may occur. The proposed network optimization algorithm ensures that distributed work is placed in the same zone as much as possible to reduce network delay. Furthermore, scoring using network monitoring tools like loss, delay, and throughput is applied to select the optimal node within the zone. Our proposal has been validated on the Kubernetes platform, an open source orchestrator for the automatic management and deployment of micro-services. The performance of distributed deep learning is improved through the proposed scheduler.
引用
收藏
页数:18
相关论文
共 50 条
  • [21] Accelerating Geo-Distributed Machine Learning With Network-Aware Adaptive Tree and Auxiliary Route
    Li, Zonghang
    Feng, Wenjiao
    Cai, Weibo
    Yu, Hongfang
    Luo, Long
    Sun, Gang
    Du, Hongyang
    Niyato, Dusit
    IEEE-ACM TRANSACTIONS ON NETWORKING, 2024, 32 (05) : 4238 - 4253
  • [22] A Network-Aware Distributed Energy Resource Aggregation Framework for Flexible, Cost-Optimal, and Resilient Operation
    Utkarsh, Kumar
    Ding, Fei
    Jin, Xin
    Blonsky, Michael
    Padullaparti, Harsha
    Balamurugan, Sivasathya Pradha
    IEEE TRANSACTIONS ON SMART GRID, 2022, 13 (02) : 1213 - 1224
  • [23] Network-Aware Server Placement for Highly Interactive Distributed Virtual Environments
    Ta, Duong
    Zhou, Suiping
    Cai, Wentono
    Tang, Xueyan
    Ayani, Rassul
    DS-RT 2008: 12TH 2008 IEEE/ACM INTERNATIONAL SYMPOSIUM ON DISTRIBUTED SIMULATION AND REAL TIME APPLICATIONS, PROCEEDINGS, 2008, : 95 - +
  • [24] A Recursive Distributed Topology Discovery Service for Network-Aware Grid Clients
    Paolucci, Francesco
    Valcarenghi, Luca
    Castoldi, Piero
    Cugini, Filippo
    2009 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS, VOLS 1-8, 2009, : 1313 - +
  • [25] Network-Aware HEFT Scheduling for Grid
    Yousaf, Muhammad Murtaza
    Welzl, Andmichael
    SCIENTIFIC WORLD JOURNAL, 2014,
  • [26] Network-aware Instance Scheduling in OpenStack
    Scharf, Michael
    Stein, Manuel
    Voith, Thomas
    Hilt, Volker
    24TH INTERNATIONAL CONFERENCE ON COMPUTER COMMUNICATIONS AND NETWORKS ICCCN 2015, 2015,
  • [27] Distributed Optimal Scheduling in UAV Swarm Network
    Sun, Wei
    2021 IEEE 18TH ANNUAL CONSUMER COMMUNICATIONS & NETWORKING CONFERENCE (CCNC), 2021,
  • [28] Clearing and Pricing for Network-Aware Local Flexibility Markets using Distributed Optimization
    Birk, Sascha
    Talari, Saber
    Gebbran, Daniel
    Ketter, Wolfgang
    Schneiders, Thorsten
    2023 IEEE BELGRADE POWERTECH, 2023,
  • [29] Network-Aware Distributed Electricity Markets: A Techno-Economic Comparative Study
    Domenech, Carmen Bas
    Riaz, Shariq
    Mancarella, Pierluigi
    2021 IEEE PES INNOVATIVE SMART GRID TECHNOLOGIES - ASIA (ISGT ASIA), 2021,
  • [30] MemEFS: A network-aware elastic in-memory runtime distributed file system
    Uta, Alexandru
    Danner, Ove
    van der Weegen, Cas
    Oprescu, Ana-Maria
    Sandu, Andreea
    Costache, Stefania
    Kielmann, Thilo
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2018, 82 : 631 - 646