An Optimal Network-Aware Scheduling Technique for Distributed Deep Learning in Distributed HPC Platforms

被引:1
|
作者
Lee, Sangkwon [1 ]
Shah, Syed Asif Raza [2 ,3 ]
Seok, Woojin [4 ]
Moon, Jeonghoon [3 ]
Kim, Kihyeon [3 ]
Shah, Syed Hasnain Raza [1 ]
机构
[1] Univ Sci & Technol, Sci & Technol Informat Sci, Daejeon 34113, South Korea
[2] Sukkur IBA Univ, Dept Comp Sci, CRAIB, Sukkur 65200, Pakistan
[3] Korea Inst Sci & Technol Informat, KREONET, Daejeon 34141, South Korea
[4] Korea Inst Sci & Technol Informat, Ctr Quantum Commun, Daejeon 34141, South Korea
关键词
cloud computing; scheduling; container technology; distributed computing; network monitoring; deep learning; distributed HPC; AI;
D O I
10.3390/electronics12143021
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Deep learning is a growing technique used to solve complex artificial intelligence (AI) problems. Large-scale deep learning has become a significant issue as a result of the expansion of datasets and the complexity of deep learning models. For training large-scale models, the cloud can be used as a distributed HPC (high-performance computing) tool with benefits in cost and flexibility. However, one of the major performance barriers in distributed deep learning in a distributed HPC environment is the network. The performance is often limited by heavy traffic like many stochastic gradient descent transfers for distributed communication. There are many network studies in distributed deep learning to solve these problems, but most research only focuses on improving communication performance and applying new methods or algorithms like overlapping parameter synchronization to minimize communication delay rather than considering the actual network. In this paper, we are focusing on the actual network, especially in a distributed HPC environment. In such an environment, if cluster nodes are assigned to different zones/regions which means a set of an appropriate number of distributed HPC nodes when performing distributed deep learning tasks, performance degradation due to network delay may occur. The proposed network optimization algorithm ensures that distributed work is placed in the same zone as much as possible to reduce network delay. Furthermore, scoring using network monitoring tools like loss, delay, and throughput is applied to select the optimal node within the zone. Our proposal has been validated on the Kubernetes platform, an open source orchestrator for the automatic management and deployment of micro-services. The performance of distributed deep learning is improved through the proposed scheduler.
引用
收藏
页数:18
相关论文
共 50 条
  • [41] MemFlow: Memory-Aware Distributed Deep Learning
    Band, Neil
    SIGMOD'20: PROCEEDINGS OF THE 2020 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2020, : 2883 - 2885
  • [42] Probabilistic Network-Aware Task Placement for MapReduce Scheduling
    Shen, Haiying
    Sarker, Ankur
    Yu, Lei
    Deng, Feng
    2016 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2016, : 241 - 250
  • [43] Network performance in distributed HPC clusters
    Huang, B
    Bauer, M
    Katchabaw, M
    PDPTA '05: Proceedings of the 2005 International Conference on Parallel and Distributed Processing Techniques and Applications, Vols 1-3, 2005, : 546 - 549
  • [44] Cachalot: A Network-Aware, Cooperative Cache Network for Geo-Distributed, Data-Intensive Applications
    Jiang, Fan
    Castillo, Claris
    Ahalt, Stan
    NOMS 2018 - 2018 IEEE/IFIP NETWORK OPERATIONS AND MANAGEMENT SYMPOSIUM, 2018,
  • [45] A network-aware VM re-scheduling algorithm
    Luo, Gang-Yi
    Qian, Zhu-Zhong
    Lu, Sang-Lu
    Jisuanji Xuebao/Chinese Journal of Computers, 2015, 38 (05): : 932 - 943
  • [46] Coflow Deadline Scheduling via Network-Aware Optimization
    Tseng, Shih-Hao
    Tang, Ao
    2018 56TH ANNUAL ALLERTON CONFERENCE ON COMMUNICATION, CONTROL, AND COMPUTING (ALLERTON), 2018, : 829 - 833
  • [47] Extending the Kubernetes Platform with Network-Aware Scheduling Capabilities
    Marchese, Angelo
    Tomarchio, Orazio
    SERVICE-ORIENTED COMPUTING (ICSOC 2022), 2022, 13740 : 465 - 480
  • [48] Communication Scheduling Optimization for Distributed Deep Learning Systems
    Tsai, Ching-Yuan
    Lin, Ching-Chi
    Liu, Pangfeng
    Wu, Jan-Jan
    2018 IEEE 24TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS 2018), 2018, : 739 - 746
  • [49] Survey on Network of Distributed Deep Learning Training
    Zhu H.
    Yuan G.
    Yao C.
    Tan G.
    Wang Z.
    Hu Z.
    Zhang X.
    An X.
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2021, 58 (01): : 98 - 115
  • [50] Distributed Optimal Power Scheduling for Microgrid System via Deep Reinforcement Learning with Privacy Preserving
    He, Tong
    Wu, Xiang
    Dong, Hui
    Guo, Fanghong
    Yu, Wei
    2022 IEEE 17TH INTERNATIONAL CONFERENCE ON CONTROL & AUTOMATION, ICCA, 2022, : 820 - 825