Looking Beyond GPUs for DNN Scheduling on Multi-Tenant Clusters

被引：0

作者：

Mohan, Jayashree ^{[1
]}

Phanishayee, Amar ^{[1
]}

Kulkarni, Janardhan ^{[1
]}

Chidambaram, Vijay ^{[2
,3
]}

机构：

[1] Microsoft Res, Redmond, WA 98052 USA

[2] Univ Texas Austin, Austin, TX USA

[3] VMware Res, Palo Alto, CA USA

来源：

PROCEEDINGS OF THE 16TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, OSDI 2022 | 2022年

关键词：

D O I：

暂无

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Training Deep Neural Networks (DNNs) is a popular workload in both enterprises and cloud data centers. Existing schedulers for DNN training consider GPU as the dominant resource and allocate other resources such as CPU and memory proportional to the number of GPUs requested by the job. Unfortunately, these schedulers do not consider the impact of a job's sensitivity to allocation of CPU and memory resources. In this work, we propose Synergy, a resource-sensitive scheduler for shared GPU clusters. Synergy infers the sensitivity of DNNs to different resources using optimistic profiling; some jobs might benefit from more than the GPU-proportional allocation and some jobs might not be affected by less than GPU-proportional allocation. Synergy performs such multi-resource workload-aware assignments across a set of jobs scheduled on shared multi-tenant clusters using a new near-optimal online algorithm. Our experiments show that workload-aware CPU and memory allocations can improve average job completion time by upto 3.4x, by better utilizing existing cluster resources, compared to traditional GPU-proportional scheduling.

引用

页码：579 / 596

页数：18

共 50 条

[11] DeepPlace: Learning to Place Applications in Multi-Tenant Clusters
Mitra, Subrata
Mondal, Shanka Subhra
Sheoran, Nikhil
Dhake, Neeraj
Nehra, Ravinder
Simha, Ramanuja
APSYS'19: PROCEEDINGS OF THE 10TH ACM SIGOPS ASIA-PACIFIC WORKSHOP ON SYSTEMS, 2019, : 61 - 68
[12] Scheduling Deep Learning Jobs in Multi-Tenant GPU Clusters via Wise Resource Sharing
Luo, Yizhou
Wang, Qiang
Shi, Shaohuai
Lai, Jiaxin
Qi, Shuhan
Zhang, Jiajia
Wang, Xuan
2024 IEEE/ACM 32ND INTERNATIONAL SYMPOSIUM ON QUALITY OF SERVICE, IWQOS, 2024,
[13] Workflow Scheduling in Multi-Tenant Cloud Computing Environments
Rimal, Bhaskar Prasad
Maier, Martin
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2017, 28 (01) : 290 - 304
[14] Elastic Deep Learning in Multi-Tenant GPU Clusters
Wu, Yidi
Ma, Kaihao
Yan, Xiao
Liu, Zhi
Cai, Zhenkun
Huang, Yuzhen
Cheng, James
Yuan, Han
Yu, Fan
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2022, 33 (01) : 144 - 158
[15] A Multi-Tenant Level Lightweight Lock Mechanism for Multi-Tenant Database
Kang, Tao
Zhang, Shidong
Kong, Lanju
2014 11th Web Information System and Application Conference (WISA), 2014, : 3 - 7
[16] On Scheduling Ring-All-Reduce Learning Jobs in Multi-Tenant GPU Clusters with Communication Contention
Yu, Menglu
Ji, Bo
Rajan, Hridesh
Liu, Jia
PROCEEDINGS OF THE 2022 THE TWENTY-THIRD INTERNATIONAL SYMPOSIUM ON THEORY, ALGORITHMIC FOUNDATIONS, AND PROTOCOL DESIGN FOR MOBILE NETWORKS AND MOBILE COMPUTING, MOBIHOC 2022, 2022, : 21 - 30
[17] Scheduling multi-tenant cloud workflow tasks with resource reliability
Xiaoping LI
Dongyuan PAN
Yadi WANG
Rubén RUIZ
ScienceChina(InformationSciences), 2022, 65 (09) : 127 - 144
[18] Towards Fair and Firm Real-Time Scheduling in DNN Multi-Tenant Multi-Accelerator Systems via Reinforcement Learning
Russo, Enrico
Blanco, Francesco Giulio
Palesi, Maurizio
Ascia, Giuseppe
Patti, Davide
Catania, Vincenzo
2024 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, ISCAS 2024, 2024,
[19] Scheduling multi-tenant cloud workflow tasks with resource reliability
Xiaoping Li
Dongyuan Pan
Yadi Wang
Rubén Ruiz
Science China Information Sciences, 2022, 65
[20] Adaptive task scheduling method in multi-tenant cloud computing
Ramegowda A.
Agarkhed J.
Patil S.R.
International Journal of Information Technology, 2020, 12 (4) : 1093 - 1102

← 1 2 3 4 5 →