Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters

被引:253
|
作者
Peng, Yanghua [1 ]
Bao, Yixin [1 ]
Chen, Yangrui [1 ]
Wu, Chuan [1 ]
Guo, Chuanxiong [2 ]
机构
[1] Univ Hong Kong, Hong Kong, Peoples R China
[2] Bytedance Inc, Beijing, Peoples R China
关键词
Resource management; deep learning;
D O I
10.1145/3190508.3190517
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Deep learning workloads are common in today's production clusters due to the proliferation of deep learning driven AI services (e.g., speech recognition, machine translation). A deep learning training job is resource-intensive and time-consuming. Efficient resource scheduling is the key to the maximal performance of a deep learning cluster. Existing cluster schedulers are largely not tailored to deep learning jobs, and typically specifying a fixed amount of resources for each job, prohibiting high resource efficiency and job performance. This paper proposes Optimus, a customized job scheduler for deep learning clusters, which minimizes job training time based on online resource-performance models. Optimus uses online fitting to predict model convergence during training, and sets up performance models to accurately estimate training speed as a function of allocated resources in each job. Based on the models, a simple yet effective method is designed and used for dynamically allocating resources and placing deep learning tasks to minimize job completion time. We implement Optimus on top of Kubernetes, a cluster manager for container orchestration, and experiment on a deep learning cluster with 7 CPU servers and 6 GPU servers, running 9 training jobs using the MXNet framework. Results show that Optimus outperforms representative cluster schedulers by about 139% and 63% in terms of job completion time and makespan, respectively.
引用
收藏
页数:14
相关论文
共 50 条
  • [21] A machine learning-based resource-efficient task scheduler for heterogeneous computer systems
    Asad Hayat
    Yasir Noman Khalid
    Muhammad Siraj Rathore
    Muhammad Nadeem Nadir
    The Journal of Supercomputing, 2023, 79 : 15700 - 15728
  • [22] Age of information-aware deep reinforcement learning for efficient cloud resource scheduling in dynamic environments
    Hu, Ke
    INTERNATIONAL JOURNAL OF INDUSTRIAL ENGINEERING COMPUTATIONS, 2025, 16 (02) : 247 - 260
  • [23] MobiLipNet: Resource-efficient deep learning based lipreading
    Koumparoulis, Alexandros
    Potamianos, Gerasimos
    INTERSPEECH 2019, 2019, : 2763 - 2767
  • [24] Resource Efficient Federated Deep Learning for IoT Security Monitoring
    Zakariyya, Idris
    Kalutarage, Harsha
    Al-Kadri, M. Omar
    ATTACKS AND DEFENSES FOR THE INTERNET OF THINGS, ADIOT, 2022, 13745 : 122 - 142
  • [25] Editorial: Resource Efficient Deep Learning for Computer Vision Applications
    Li, Yang
    Song, Houbing Herbert
    MOBILE NETWORKS & APPLICATIONS, 2024, : 1324 - 1325
  • [26] Resource Efficient Deep Learning Hardware Watermarks with Signature Alignment
    Clements, Joseph
    Lao, Yingjie
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 10, 2024, : 11651 - 11659
  • [27] Dynamic Resource Allocation for Metaverse Applications with Deep Reinforcement Learning
    Chu, Nam H.
    Nguyen, Diep N.
    Hoang, Dinh Thai
    Phan, Khoa T.
    Dutkiewicz, Eryk
    Niyato, Dusit
    Shu, Tao
    2023 IEEE WIRELESS COMMUNICATIONS AND NETWORKING CONFERENCE, WCNC, 2023,
  • [28] Dynamic resource matching in manufacturing using deep reinforcement learning
    Panda, Saunak Kumar
    Xiang, Yisha
    Liu, Ruiqi
    EUROPEAN JOURNAL OF OPERATIONAL RESEARCH, 2024, 318 (02) : 408 - 423
  • [29] Dynamic Resource Allocation in Network Slicing with Deep Reinforcement Learning
    Cai, Yue
    Cheng, Peng
    Chen, Zhuo
    Xiang, Wei
    Vucetic, Branka
    Li, Yonghui
    IEEE CONFERENCE ON GLOBAL COMMUNICATIONS, GLOBECOM, 2023, : 2955 - 2960
  • [30] Efficient Decentralized Deep Learning by Dynamic Model Averaging
    Kamp, Michael
    Adilova, Linara
    Sicking, Joachim
    Hueger, Fabian
    Schlicht, Peter
    Wirtz, Tim
    Wrobel, Stefan
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2018, PT I, 2019, 11051 : 393 - 409