Looking Beyond GPUs for DNN Scheduling on Multi-Tenant Clusters

被引：0

作者：

Mohan, Jayashree ^{[1
]}

Phanishayee, Amar ^{[1
]}

Kulkarni, Janardhan ^{[1
]}

Chidambaram, Vijay ^{[2
,3
]}

机构：

[1] Microsoft Res, Redmond, WA 98052 USA

[2] Univ Texas Austin, Austin, TX USA

[3] VMware Res, Palo Alto, CA USA

来源：

PROCEEDINGS OF THE 16TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, OSDI 2022 | 2022年

关键词：

D O I：

暂无

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Training Deep Neural Networks (DNNs) is a popular workload in both enterprises and cloud data centers. Existing schedulers for DNN training consider GPU as the dominant resource and allocate other resources such as CPU and memory proportional to the number of GPUs requested by the job. Unfortunately, these schedulers do not consider the impact of a job's sensitivity to allocation of CPU and memory resources. In this work, we propose Synergy, a resource-sensitive scheduler for shared GPU clusters. Synergy infers the sensitivity of DNNs to different resources using optimistic profiling; some jobs might benefit from more than the GPU-proportional allocation and some jobs might not be affected by less than GPU-proportional allocation. Synergy performs such multi-resource workload-aware assignments across a set of jobs scheduled on shared multi-tenant clusters using a new near-optimal online algorithm. Our experiments show that workload-aware CPU and memory allocations can improve average job completion time by upto 3.4x, by better utilizing existing cluster resources, compared to traditional GPU-proportional scheduling.

引用

页码：579 / 596

页数：18

共 50 条

[1] Daphne: A Flexible and Hybrid Scheduling Framework in Multi-Tenant Clusters
Xia, Yiqian
Ren, Rui
Cai, Hongming
Vasilakos, Athanasios V.
Lv, Zheng
IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, 2018, 15 (01): : 330 - 343
[2] KubeSphere: An Approach to Multi-Tenant Fair Scheduling for Kubernetes Clusters
Beltre, Angel
Saha, Pankaj
Govindaraju, Madhusudhan
2019 3RD IEEE INTERNATIONAL CONFERENCE ON CLOUD AND FOG COMPUTING TECHNOLOGIES AND APPLICATIONS (IEEE CLOUD SUMMIT 2019), 2019, : 14 - 20
[3] Automated Runtime-Aware Scheduling for Multi-Tenant DNN Inference on GPU
Yu, Fuxun
Bray, Shawn
Wang, Di
Shangguan, Longfei
Tang, Xulong
Liu, Chenchen
Chen, Xiang
2021 IEEE/ACM INTERNATIONAL CONFERENCE ON COMPUTER AIDED DESIGN (ICCAD), 2021,
[4] Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads
Jeon, Myeongjae
Venkataraman, Shivaram
Phanishayee, Amar
Qian, Junjie
Xiao, Wencong
Yang, Fan
PROCEEDINGS OF THE 2019 USENIX ANNUAL TECHNICAL CONFERENCE, 2019, : 947 - 960
[5] Phurti: Application and Network-Aware Flow Scheduling for Multi-Tenant MapReduce Clusters
Cai, Chris X.
Saeed, Shayan
Gupta, Indranil
Campbell, Roy H.
Le, Franck
PROCEEDINGS 2016 IEEE INTERNATIONAL CONFERENCE ON CLOUD ENGINEERING (IC2E), 2016, : 161 - 170
[6] PPS: Fair and efficient black-box scheduling for multi-tenant GPU clusters
Ma, Kaihao
Cai, Zhenkun
Yan, Xiao
Zhang, Yang
Liu, Zhi
Feng, Yihui
Li, Chao
Lin, Wei
Cheng, James
PARALLEL COMPUTING, 2024, 120
[7] PPS: Fair and efficient black-box scheduling for multi-tenant GPU clusters
Ma, Kaihao
Cai, Zhenkun
Yan, Xiao
Zhang, Yang
Liu, Zhi
Feng, Yihui
Li, Chao
Lin, Wei
Cheng, James
Parallel Computing, 2024, 120
[8] Adaptive Kernel Merge and Fusion for Multi-Tenant Inference in Embedded GPUs
Jeon, Jaebeom
Koo, Gunjae
Yoon, Myung Kuk
Oh, Yunho
IEEE EMBEDDED SYSTEMS LETTERS, 2024, 16 (04) : 421 - 424
[9] OPTiC: Opportunistic Graph Processing in Multi-Tenant Clusters
Rahman, Muntasir Raihan
Gupta, Indranil
Kapoor, Akash
Ding, Haozhen
2018 IEEE INTERNATIONAL CONFERENCE ON CLOUD ENGINEERING (IC2E 2018), 2018, : 113 - 123
[10] Multi-tenant virtual GPUs for optimising performance of a financial risk application
Prades, Javier
Varghese, Blesson
Reano, Carlos
Silla, Federico
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2017, 108 : 28 - 44

← 1 2 3 4 5 →