Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters

被引:253
|
作者
Peng, Yanghua [1 ]
Bao, Yixin [1 ]
Chen, Yangrui [1 ]
Wu, Chuan [1 ]
Guo, Chuanxiong [2 ]
机构
[1] Univ Hong Kong, Hong Kong, Peoples R China
[2] Bytedance Inc, Beijing, Peoples R China
关键词
Resource management; deep learning;
D O I
10.1145/3190508.3190517
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Deep learning workloads are common in today's production clusters due to the proliferation of deep learning driven AI services (e.g., speech recognition, machine translation). A deep learning training job is resource-intensive and time-consuming. Efficient resource scheduling is the key to the maximal performance of a deep learning cluster. Existing cluster schedulers are largely not tailored to deep learning jobs, and typically specifying a fixed amount of resources for each job, prohibiting high resource efficiency and job performance. This paper proposes Optimus, a customized job scheduler for deep learning clusters, which minimizes job training time based on online resource-performance models. Optimus uses online fitting to predict model convergence during training, and sets up performance models to accurately estimate training speed as a function of allocated resources in each job. Based on the models, a simple yet effective method is designed and used for dynamically allocating resources and placing deep learning tasks to minimize job completion time. We implement Optimus on top of Kubernetes, a cluster manager for container orchestration, and experiment on a deep learning cluster with 7 CPU servers and 6 GPU servers, running 9 training jobs using the MXNet framework. Results show that Optimus outperforms representative cluster schedulers by about 139% and 63% in terms of job completion time and makespan, respectively.
引用
收藏
页数:14
相关论文
共 50 条
  • [41] Dynamic Resource Allocation With Deep Reinforcement Learning in Multibeam Satellite Communication
    Deng, Danhao
    Wang, Chaowei
    Pang, Mingliang
    Wang, Weidong
    IEEE WIRELESS COMMUNICATIONS LETTERS, 2023, 12 (01) : 75 - 79
  • [42] Dynamic Resource Shaping for Compute Clusters
    Pace, Francesco
    Milios, Dimitrios
    Carra, Damiano
    Michiardi, Pietro
    2019 IEEE INTERNATIONAL CONGRESS ON BIG DATA (IEEE BIGDATA CONGRESS 2019), 2019, : 45 - 54
  • [43] DynaFuse: Dynamic Fusion for Resource Efficient Multimodal Machine Learning Inference
    Alikhani, Hamidreza
    Kanduri, Anil
    Liljeberg, Pasi
    Rahmani, Amir M.
    Dutt, Nikil
    IEEE EMBEDDED SYSTEMS LETTERS, 2023, 15 (04) : 222 - 225
  • [44] Towards green networking: Efficient dynamic radio resource management in Open-RAN slicing using deep reinforcement learning and transfer learning
    Sherif, Heba
    Ahmed, Eman
    Kotb, Amira M.
    COMPUTER COMMUNICATIONS, 2025, 236
  • [45] TensorLightning: A Traffic-Efficient Distributed Deep Learning on Commodity Spark Clusters
    Lee, Seil
    Kim, Hanjoo
    Park, Jaehong
    Jang, Jaehee
    Jeong, Chang-Sung
    Yoon, Sungroh
    IEEE ACCESS, 2018, 6 : 27671 - 27680
  • [46] A Robust and Efficient Deep Learning Method for Dynamical Mass Measurements of Galaxy Clusters
    Ho, Matthew
    Rau, Markus Michael
    Ntampaka, Michelle
    Farahi, Arya
    Trac, Hy
    Poczos, Barnabas
    ASTROPHYSICAL JOURNAL, 2019, 887 (01):
  • [47] RLSK: A Job Scheduler for Federated Kubernetes Clusters based on Reinforcement Learning
    Huang, Jiaming
    Xiao, Chuming
    Wu, Weigang
    2020 IEEE INTERNATIONAL CONFERENCE ON CLOUD ENGINEERING (IC2E 2020), 2020, : 116 - 123
  • [48] To cloud or not to cloud: an on-line scheduler for dynamic privacy-protection of deep learning workload on edge devices
    Yibin Tang
    Ying Wang
    Huawei Li
    Xiaowei Li
    CCF Transactions on High Performance Computing, 2021, 3 : 85 - 100
  • [49] Adversarial Attacks in a Deep Reinforcement Learning based Cluster Scheduler
    Zhang, Shaojun
    Wang, Chen
    Zomaya, Albert Y.
    2020 IEEE 28TH INTERNATIONAL SYMPOSIUM ON MODELING, ANALYSIS, AND SIMULATION OF COMPUTER AND TELECOMMUNICATION SYSTEMS (MASCOTS 2020), 2020, : 1 - 8
  • [50] A Deep-Reinforcement-Learning-Based Scheduler for FPGA HLS
    Chen, Hongzheng
    Shen, Minghua
    2019 IEEE/ACM INTERNATIONAL CONFERENCE ON COMPUTER-AIDED DESIGN (ICCAD), 2019,