Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters

被引：253

作者：

Peng, Yanghua ^{[1
]}

Bao, Yixin ^{[1
]}

Chen, Yangrui ^{[1
]}

Wu, Chuan ^{[1
]}

Guo, Chuanxiong ^{[2
]}

机构：

[1] Univ Hong Kong, Hong Kong, Peoples R China

[2] Bytedance Inc, Beijing, Peoples R China

来源：

EUROSYS '18: PROCEEDINGS OF THE THIRTEENTH EUROSYS CONFERENCE | 2018年

关键词：

Resource management; deep learning;

D O I：

10.1145/3190508.3190517

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Deep learning workloads are common in today's production clusters due to the proliferation of deep learning driven AI services (e.g., speech recognition, machine translation). A deep learning training job is resource-intensive and time-consuming. Efficient resource scheduling is the key to the maximal performance of a deep learning cluster. Existing cluster schedulers are largely not tailored to deep learning jobs, and typically specifying a fixed amount of resources for each job, prohibiting high resource efficiency and job performance. This paper proposes Optimus, a customized job scheduler for deep learning clusters, which minimizes job training time based on online resource-performance models. Optimus uses online fitting to predict model convergence during training, and sets up performance models to accurately estimate training speed as a function of allocated resources in each job. Based on the models, a simple yet effective method is designed and used for dynamically allocating resources and placing deep learning tasks to minimize job completion time. We implement Optimus on top of Kubernetes, a cluster manager for container orchestration, and experiment on a deep learning cluster with 7 CPU servers and 6 GPU servers, running 9 training jobs using the MXNet framework. Results show that Optimus outperforms representative cluster schedulers by about 139% and 63% in terms of job completion time and makespan, respectively.

引用

页数：14

共 50 条

[31] Optimus: Towards Optimal Layer-Fusion on Deep Learning Processors
Cai, Xuyi
Wang, Ying
Zhang, Lei
LCTES '21: PROCEEDINGS OF THE 22ND ACM SIGPLAN/SIGBED INTERNATIONAL CONFERENCE ON LANGUAGES, COMPILERS, AND TOOLS FOR EMBEDDED SYSTEMS, 2021, : 67 - 79
[32] Efficient Low-Complexity Scheduler for Wireless Resource Virtualization
Kalil, M.
Moubayed, A.
Shami, A.
Al-Dweik, A.
IEEE WIRELESS COMMUNICATIONS LETTERS, 2016, 5 (01) : 56 - 59
[33] Autonomous Learning for Efficient Resource Utilization of Dynamic VM Migration
Choi, Hyung Won
Kwak, Hukeun
Sohn, Andrew
Chung, Kyusik
ICS'08: PROCEEDINGS OF THE 2008 ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING, 2008, : 185 - +
[34] Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters
Zhang, Hao
Zheng, Zeyu
Xu, Shizhen
Dai, Wei
Hoe, Qirong
Liang, Xiaodan
Hu, Zhiting
Weil, Jinliang
Xie, Pengtao
Xing, Eric P.
2017 USENIX ANNUAL TECHNICAL CONFERENCE (USENIX ATC '17), 2017, : 181 - 193
[35] DeepGANTT: A Scalable Deep Learning Scheduler for Backscatter Networks
Perez-Ramirez, Daniel F.
Perez-Penichet, Carlos
Tsiftes, Nicolas
Voigt, Thiemo
Kostic, Dejan
Boman, Magnus
PROCEEDINGS OF THE 2023 THE 22ND INTERNATIONAL CONFERENCE ON INFORMATION PROCESSING IN SENSOR NETWORKS, IPSN 2023, 2023, : 163 - 176
[36] Efficient Deep Structure Learning for Resource-Limited IoT Devices
Shen, Shibo
Li, Rongpeng
Zhao, Zhifeng
Liu, Qing
Liang, Jing
Zhang, Honggang
2020 IEEE GLOBAL COMMUNICATIONS CONFERENCE (GLOBECOM), 2020,
[37] Resource-Efficient Deep Learning: Fast Hand Gestures on Microcontrollers
Mach, Tuan Kiet Tran
Van, Khai Nguyen
Le, Minhhuy
EAI Endorsed Transactions on Industrial Networks and Intelligent Systems, 2024, 11 (03) : 1 - 11
[38] GADaM: Generic Adaptive Deep-learning-based Multipath Scheduler Selector for Dynamic Heterogeneous Environment
Chu, Tran-Tuan
Labiod, Mohamed Aymen
Tran, Hai-Anh
Mellouk, Abdelhamid
IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC 2022), 2022, : 4908 - 4913
[39] Traffic Prediction-Enabled Energy-Efficient Dynamic Computing Resource Allocation in CRAN Based on Deep Learning
Fu, Yongqin
Wang, Xianbin
IEEE OPEN JOURNAL OF THE COMMUNICATIONS SOCIETY, 2022, 3 : 159 - 175
[40] HUNHODRL: Energy efficient resource distribution in a cloud environment using hybrid optimized deep reinforcement model with HunterPlus scheduler
Chellamuthu, Senthilkumar
Ramanathan, Kalaivani
Arivanandhan, Rajesh
NETWORK-COMPUTATION IN NEURAL SYSTEMS, 2025,

← 1 2 3 4 5 →