Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters

被引：253

作者：

Peng, Yanghua ^{[1
]}

Bao, Yixin ^{[1
]}

Chen, Yangrui ^{[1
]}

Wu, Chuan ^{[1
]}

Guo, Chuanxiong ^{[2
]}

机构：

[1] Univ Hong Kong, Hong Kong, Peoples R China

[2] Bytedance Inc, Beijing, Peoples R China

来源：

EUROSYS '18: PROCEEDINGS OF THE THIRTEENTH EUROSYS CONFERENCE | 2018年

关键词：

Resource management; deep learning;

D O I：

10.1145/3190508.3190517

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Deep learning workloads are common in today's production clusters due to the proliferation of deep learning driven AI services (e.g., speech recognition, machine translation). A deep learning training job is resource-intensive and time-consuming. Efficient resource scheduling is the key to the maximal performance of a deep learning cluster. Existing cluster schedulers are largely not tailored to deep learning jobs, and typically specifying a fixed amount of resources for each job, prohibiting high resource efficiency and job performance. This paper proposes Optimus, a customized job scheduler for deep learning clusters, which minimizes job training time based on online resource-performance models. Optimus uses online fitting to predict model convergence during training, and sets up performance models to accurately estimate training speed as a function of allocated resources in each job. Based on the models, a simple yet effective method is designed and used for dynamically allocating resources and placing deep learning tasks to minimize job completion time. We implement Optimus on top of Kubernetes, a cluster manager for container orchestration, and experiment on a deep learning cluster with 7 CPU servers and 6 GPU servers, running 9 training jobs using the MXNet framework. Results show that Optimus outperforms representative cluster schedulers by about 139% and 63% in terms of job completion time and makespan, respectively.

引用

页数：14

共 50 条

[21] A machine learning-based resource-efficient task scheduler for heterogeneous computer systems
Asad Hayat
Yasir Noman Khalid
Muhammad Siraj Rathore
Muhammad Nadeem Nadir
The Journal of Supercomputing, 2023, 79 : 15700 - 15728
[22] Age of information-aware deep reinforcement learning for efficient cloud resource scheduling in dynamic environments
Hu, Ke
INTERNATIONAL JOURNAL OF INDUSTRIAL ENGINEERING COMPUTATIONS, 2025, 16 (02) : 247 - 260
[23] MobiLipNet: Resource-efficient deep learning based lipreading
Koumparoulis, Alexandros
Potamianos, Gerasimos
INTERSPEECH 2019, 2019, : 2763 - 2767
[24] Resource Efficient Federated Deep Learning for IoT Security Monitoring
Zakariyya, Idris
Kalutarage, Harsha
Al-Kadri, M. Omar
ATTACKS AND DEFENSES FOR THE INTERNET OF THINGS, ADIOT, 2022, 13745 : 122 - 142
[25] Editorial: Resource Efficient Deep Learning for Computer Vision Applications
Li, Yang
Song, Houbing Herbert
MOBILE NETWORKS & APPLICATIONS, 2024, : 1324 - 1325
[26] Resource Efficient Deep Learning Hardware Watermarks with Signature Alignment
Clements, Joseph
Lao, Yingjie
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 10, 2024, : 11651 - 11659
[27] Dynamic Resource Allocation for Metaverse Applications with Deep Reinforcement Learning
Chu, Nam H.
Nguyen, Diep N.
Hoang, Dinh Thai
Phan, Khoa T.
Dutkiewicz, Eryk
Niyato, Dusit
Shu, Tao
2023 IEEE WIRELESS COMMUNICATIONS AND NETWORKING CONFERENCE, WCNC, 2023,
[28] Dynamic resource matching in manufacturing using deep reinforcement learning
Panda, Saunak Kumar
Xiang, Yisha
Liu, Ruiqi
EUROPEAN JOURNAL OF OPERATIONAL RESEARCH, 2024, 318 (02) : 408 - 423
[29] Dynamic Resource Allocation in Network Slicing with Deep Reinforcement Learning
Cai, Yue
Cheng, Peng
Chen, Zhuo
Xiang, Wei
Vucetic, Branka
Li, Yonghui
IEEE CONFERENCE ON GLOBAL COMMUNICATIONS, GLOBECOM, 2023, : 2955 - 2960
[30] Efficient Decentralized Deep Learning by Dynamic Model Averaging
Kamp, Michael
Adilova, Linara
Sicking, Joachim
Hueger, Fabian
Schlicht, Peter
Wirtz, Tim
Wrobel, Stefan
MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2018, PT I, 2019, 11051 : 393 - 409

← 1 2 3 4 5 →