Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters

被引：253

作者：

Peng, Yanghua ^{[1
]}

Bao, Yixin ^{[1
]}

Chen, Yangrui ^{[1
]}

Wu, Chuan ^{[1
]}

Guo, Chuanxiong ^{[2
]}

机构：

[1] Univ Hong Kong, Hong Kong, Peoples R China

[2] Bytedance Inc, Beijing, Peoples R China

来源：

EUROSYS '18: PROCEEDINGS OF THE THIRTEENTH EUROSYS CONFERENCE | 2018年

关键词：

Resource management; deep learning;

D O I：

10.1145/3190508.3190517

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Deep learning workloads are common in today's production clusters due to the proliferation of deep learning driven AI services (e.g., speech recognition, machine translation). A deep learning training job is resource-intensive and time-consuming. Efficient resource scheduling is the key to the maximal performance of a deep learning cluster. Existing cluster schedulers are largely not tailored to deep learning jobs, and typically specifying a fixed amount of resources for each job, prohibiting high resource efficiency and job performance. This paper proposes Optimus, a customized job scheduler for deep learning clusters, which minimizes job training time based on online resource-performance models. Optimus uses online fitting to predict model convergence during training, and sets up performance models to accurately estimate training speed as a function of allocated resources in each job. Based on the models, a simple yet effective method is designed and used for dynamically allocating resources and placing deep learning tasks to minimize job completion time. We implement Optimus on top of Kubernetes, a cluster manager for container orchestration, and experiment on a deep learning cluster with 7 CPU servers and 6 GPU servers, running 9 training jobs using the MXNet framework. Results show that Optimus outperforms representative cluster schedulers by about 139% and 63% in terms of job completion time and makespan, respectively.

引用

页数：14

共 50 条

[41] Dynamic Resource Allocation With Deep Reinforcement Learning in Multibeam Satellite Communication
Deng, Danhao
Wang, Chaowei
Pang, Mingliang
Wang, Weidong
IEEE WIRELESS COMMUNICATIONS LETTERS, 2023, 12 (01) : 75 - 79
[42] Dynamic Resource Shaping for Compute Clusters
Pace, Francesco
Milios, Dimitrios
Carra, Damiano
Michiardi, Pietro
2019 IEEE INTERNATIONAL CONGRESS ON BIG DATA (IEEE BIGDATA CONGRESS 2019), 2019, : 45 - 54
[43] DynaFuse: Dynamic Fusion for Resource Efficient Multimodal Machine Learning Inference
Alikhani, Hamidreza
Kanduri, Anil
Liljeberg, Pasi
Rahmani, Amir M.
Dutt, Nikil
IEEE EMBEDDED SYSTEMS LETTERS, 2023, 15 (04) : 222 - 225
[44] Towards green networking: Efficient dynamic radio resource management in Open-RAN slicing using deep reinforcement learning and transfer learning
Sherif, Heba
Ahmed, Eman
Kotb, Amira M.
COMPUTER COMMUNICATIONS, 2025, 236
[45] TensorLightning: A Traffic-Efficient Distributed Deep Learning on Commodity Spark Clusters
Lee, Seil
Kim, Hanjoo
Park, Jaehong
Jang, Jaehee
Jeong, Chang-Sung
Yoon, Sungroh
IEEE ACCESS, 2018, 6 : 27671 - 27680
[46] A Robust and Efficient Deep Learning Method for Dynamical Mass Measurements of Galaxy Clusters
Ho, Matthew
Rau, Markus Michael
Ntampaka, Michelle
Farahi, Arya
Trac, Hy
Poczos, Barnabas
ASTROPHYSICAL JOURNAL, 2019, 887 (01):
[47] RLSK: A Job Scheduler for Federated Kubernetes Clusters based on Reinforcement Learning
Huang, Jiaming
Xiao, Chuming
Wu, Weigang
2020 IEEE INTERNATIONAL CONFERENCE ON CLOUD ENGINEERING (IC2E 2020), 2020, : 116 - 123
[48] To cloud or not to cloud: an on-line scheduler for dynamic privacy-protection of deep learning workload on edge devices
Yibin Tang
Ying Wang
Huawei Li
Xiaowei Li
CCF Transactions on High Performance Computing, 2021, 3 : 85 - 100
[49] Adversarial Attacks in a Deep Reinforcement Learning based Cluster Scheduler
Zhang, Shaojun
Wang, Chen
Zomaya, Albert Y.
2020 IEEE 28TH INTERNATIONAL SYMPOSIUM ON MODELING, ANALYSIS, AND SIMULATION OF COMPUTER AND TELECOMMUNICATION SYSTEMS (MASCOTS 2020), 2020, : 1 - 8
[50] A Deep-Reinforcement-Learning-Based Scheduler for FPGA HLS
Chen, Hongzheng
Shen, Minghua
2019 IEEE/ACM INTERNATIONAL CONFERENCE ON COMPUTER-AIDED DESIGN (ICCAD), 2019,

← 1 2 3 4 5 →