Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters

被引:253
|
作者
Peng, Yanghua [1 ]
Bao, Yixin [1 ]
Chen, Yangrui [1 ]
Wu, Chuan [1 ]
Guo, Chuanxiong [2 ]
机构
[1] Univ Hong Kong, Hong Kong, Peoples R China
[2] Bytedance Inc, Beijing, Peoples R China
关键词
Resource management; deep learning;
D O I
10.1145/3190508.3190517
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Deep learning workloads are common in today's production clusters due to the proliferation of deep learning driven AI services (e.g., speech recognition, machine translation). A deep learning training job is resource-intensive and time-consuming. Efficient resource scheduling is the key to the maximal performance of a deep learning cluster. Existing cluster schedulers are largely not tailored to deep learning jobs, and typically specifying a fixed amount of resources for each job, prohibiting high resource efficiency and job performance. This paper proposes Optimus, a customized job scheduler for deep learning clusters, which minimizes job training time based on online resource-performance models. Optimus uses online fitting to predict model convergence during training, and sets up performance models to accurately estimate training speed as a function of allocated resources in each job. Based on the models, a simple yet effective method is designed and used for dynamically allocating resources and placing deep learning tasks to minimize job completion time. We implement Optimus on top of Kubernetes, a cluster manager for container orchestration, and experiment on a deep learning cluster with 7 CPU servers and 6 GPU servers, running 9 training jobs using the MXNet framework. Results show that Optimus outperforms representative cluster schedulers by about 139% and 63% in terms of job completion time and makespan, respectively.
引用
收藏
页数:14
相关论文
共 50 条
  • [1] On a Meta Learning-Based Scheduler for Deep Learning Clusters
    Yang, Jin
    Bao, Liang
    Liu, Wenjing
    Yang, Rong
    Wu, Chase Q.
    IEEE TRANSACTIONS ON CLOUD COMPUTING, 2023, 11 (04) : 3631 - 3642
  • [2] Dynamic Scheduler Management Using Deep Learning
    Hall, James
    Moessner, Klaus
    Mackenzie, Richard
    Carrez, Francois
    Foh, Chuan Heng
    IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING, 2020, 6 (02) : 575 - 585
  • [3] DL2: A Deep Learning-Driven Scheduler for Deep Learning Clusters
    Peng, Yanghua
    Bao, Yixin
    Chen, Yangrui
    Wu, Chuan
    Meng, Chen
    Lin, Wei
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2021, 32 (08) : 1947 - 1960
  • [4] PickyMan: A Preemptive Scheduler for Deep Learning Jobs on GPU Clusters
    Chen, Chen
    Chen, Yingwen
    Chen, Zhaoyun
    Han, Jianchen
    Xue, Guangtao
    2022 IEEE INTERNATIONAL PERFORMANCE, COMPUTING, AND COMMUNICATIONS CONFERENCE, IPCCC, 2022,
  • [5] Elastic scheduler: Heterogeneous and dynamic deep Learning in the cloud
    Yin, Lujia
    Zhang, Yiming
    Peng, Yuxing
    Li, Dongsheng
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2021, 33 (13):
  • [6] GreenFlow: A Carbon-Efficient Scheduler for Deep Learning Workloads
    Gu, Diandian
    Zhao, Yihao
    Sun, Peng
    Jin, Xin
    Liu, Xuanzhe
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2025, 36 (02) : 168 - 184
  • [7] Astraea: A Fair Deep Learning Scheduler for Multi-Tenant GPU Clusters
    Ye, Zhisheng
    Sun, Peng
    Gao, Wei
    Zhang, Tianwei
    Wang, Xiaolin
    Yan, Shengen
    Luo, Yingwei
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2022, 33 (11) : 2781 - 2793
  • [8] Profiling Scheduler for Efficient Resource Utilization
    Bogdanov, Alexander
    Gaiduchok, Vladimir
    Ahmed, Nabil
    Cubahiro, Amissi
    Gankevich, Ivan
    COMPUTATIONAL SCIENCE AND ITS APPLICATIONS - ICCSA 2015, PT IV, 2015, 9158 : 299 - 310
  • [9] A Deep Reinforcement Learning-Based Resource Scheduler for Massive MIMO Networks
    An, Qing
    Segarra, Santiago
    Dick, Chris
    Sabharwal, Ashutosh
    Doost-Mohammady, Rahman
    IEEE Transactions on Machine Learning in Communications and Networking, 2023, 1 : 242 - 257
  • [10] KubCG: A dynamic Kubernetes scheduler for heterogeneous clusters
    Ahmed, Ghofrane El Haj
    Gil-Castineira, Felipe
    Costa-Montenegro, Enrique
    SOFTWARE-PRACTICE & EXPERIENCE, 2021, 51 (02): : 213 - 234