Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters

被引:253
|
作者
Peng, Yanghua [1 ]
Bao, Yixin [1 ]
Chen, Yangrui [1 ]
Wu, Chuan [1 ]
Guo, Chuanxiong [2 ]
机构
[1] Univ Hong Kong, Hong Kong, Peoples R China
[2] Bytedance Inc, Beijing, Peoples R China
关键词
Resource management; deep learning;
D O I
10.1145/3190508.3190517
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Deep learning workloads are common in today's production clusters due to the proliferation of deep learning driven AI services (e.g., speech recognition, machine translation). A deep learning training job is resource-intensive and time-consuming. Efficient resource scheduling is the key to the maximal performance of a deep learning cluster. Existing cluster schedulers are largely not tailored to deep learning jobs, and typically specifying a fixed amount of resources for each job, prohibiting high resource efficiency and job performance. This paper proposes Optimus, a customized job scheduler for deep learning clusters, which minimizes job training time based on online resource-performance models. Optimus uses online fitting to predict model convergence during training, and sets up performance models to accurately estimate training speed as a function of allocated resources in each job. Based on the models, a simple yet effective method is designed and used for dynamically allocating resources and placing deep learning tasks to minimize job completion time. We implement Optimus on top of Kubernetes, a cluster manager for container orchestration, and experiment on a deep learning cluster with 7 CPU servers and 6 GPU servers, running 9 training jobs using the MXNet framework. Results show that Optimus outperforms representative cluster schedulers by about 139% and 63% in terms of job completion time and makespan, respectively.
引用
收藏
页数:14
相关论文
共 50 条
  • [31] Optimus: Towards Optimal Layer-Fusion on Deep Learning Processors
    Cai, Xuyi
    Wang, Ying
    Zhang, Lei
    LCTES '21: PROCEEDINGS OF THE 22ND ACM SIGPLAN/SIGBED INTERNATIONAL CONFERENCE ON LANGUAGES, COMPILERS, AND TOOLS FOR EMBEDDED SYSTEMS, 2021, : 67 - 79
  • [32] Efficient Low-Complexity Scheduler for Wireless Resource Virtualization
    Kalil, M.
    Moubayed, A.
    Shami, A.
    Al-Dweik, A.
    IEEE WIRELESS COMMUNICATIONS LETTERS, 2016, 5 (01) : 56 - 59
  • [33] Autonomous Learning for Efficient Resource Utilization of Dynamic VM Migration
    Choi, Hyung Won
    Kwak, Hukeun
    Sohn, Andrew
    Chung, Kyusik
    ICS'08: PROCEEDINGS OF THE 2008 ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING, 2008, : 185 - +
  • [34] Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters
    Zhang, Hao
    Zheng, Zeyu
    Xu, Shizhen
    Dai, Wei
    Hoe, Qirong
    Liang, Xiaodan
    Hu, Zhiting
    Weil, Jinliang
    Xie, Pengtao
    Xing, Eric P.
    2017 USENIX ANNUAL TECHNICAL CONFERENCE (USENIX ATC '17), 2017, : 181 - 193
  • [35] DeepGANTT: A Scalable Deep Learning Scheduler for Backscatter Networks
    Perez-Ramirez, Daniel F.
    Perez-Penichet, Carlos
    Tsiftes, Nicolas
    Voigt, Thiemo
    Kostic, Dejan
    Boman, Magnus
    PROCEEDINGS OF THE 2023 THE 22ND INTERNATIONAL CONFERENCE ON INFORMATION PROCESSING IN SENSOR NETWORKS, IPSN 2023, 2023, : 163 - 176
  • [36] Efficient Deep Structure Learning for Resource-Limited IoT Devices
    Shen, Shibo
    Li, Rongpeng
    Zhao, Zhifeng
    Liu, Qing
    Liang, Jing
    Zhang, Honggang
    2020 IEEE GLOBAL COMMUNICATIONS CONFERENCE (GLOBECOM), 2020,
  • [37] Resource-Efficient Deep Learning: Fast Hand Gestures on Microcontrollers
    Mach, Tuan Kiet Tran
    Van, Khai Nguyen
    Le, Minhhuy
    EAI Endorsed Transactions on Industrial Networks and Intelligent Systems, 2024, 11 (03) : 1 - 11
  • [38] GADaM: Generic Adaptive Deep-learning-based Multipath Scheduler Selector for Dynamic Heterogeneous Environment
    Chu, Tran-Tuan
    Labiod, Mohamed Aymen
    Tran, Hai-Anh
    Mellouk, Abdelhamid
    IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC 2022), 2022, : 4908 - 4913
  • [39] Traffic Prediction-Enabled Energy-Efficient Dynamic Computing Resource Allocation in CRAN Based on Deep Learning
    Fu, Yongqin
    Wang, Xianbin
    IEEE OPEN JOURNAL OF THE COMMUNICATIONS SOCIETY, 2022, 3 : 159 - 175
  • [40] HUNHODRL: Energy efficient resource distribution in a cloud environment using hybrid optimized deep reinforcement model with HunterPlus scheduler
    Chellamuthu, Senthilkumar
    Ramanathan, Kalaivani
    Arivanandhan, Rajesh
    NETWORK-COMPUTATION IN NEURAL SYSTEMS, 2025,