Looking Beyond GPUs for DNN Scheduling on Multi-Tenant Clusters

被引:0
|
作者
Mohan, Jayashree [1 ]
Phanishayee, Amar [1 ]
Kulkarni, Janardhan [1 ]
Chidambaram, Vijay [2 ,3 ]
机构
[1] Microsoft Res, Redmond, WA 98052 USA
[2] Univ Texas Austin, Austin, TX USA
[3] VMware Res, Palo Alto, CA USA
来源
PROCEEDINGS OF THE 16TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, OSDI 2022 | 2022年
关键词
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Training Deep Neural Networks (DNNs) is a popular workload in both enterprises and cloud data centers. Existing schedulers for DNN training consider GPU as the dominant resource and allocate other resources such as CPU and memory proportional to the number of GPUs requested by the job. Unfortunately, these schedulers do not consider the impact of a job's sensitivity to allocation of CPU and memory resources. In this work, we propose Synergy, a resource-sensitive scheduler for shared GPU clusters. Synergy infers the sensitivity of DNNs to different resources using optimistic profiling; some jobs might benefit from more than the GPU-proportional allocation and some jobs might not be affected by less than GPU-proportional allocation. Synergy performs such multi-resource workload-aware assignments across a set of jobs scheduled on shared multi-tenant clusters using a new near-optimal online algorithm. Our experiments show that workload-aware CPU and memory allocations can improve average job completion time by upto 3.4x, by better utilizing existing cluster resources, compared to traditional GPU-proportional scheduling.
引用
收藏
页码:579 / 596
页数:18
相关论文
共 50 条
  • [11] DeepPlace: Learning to Place Applications in Multi-Tenant Clusters
    Mitra, Subrata
    Mondal, Shanka Subhra
    Sheoran, Nikhil
    Dhake, Neeraj
    Nehra, Ravinder
    Simha, Ramanuja
    APSYS'19: PROCEEDINGS OF THE 10TH ACM SIGOPS ASIA-PACIFIC WORKSHOP ON SYSTEMS, 2019, : 61 - 68
  • [12] Scheduling Deep Learning Jobs in Multi-Tenant GPU Clusters via Wise Resource Sharing
    Luo, Yizhou
    Wang, Qiang
    Shi, Shaohuai
    Lai, Jiaxin
    Qi, Shuhan
    Zhang, Jiajia
    Wang, Xuan
    2024 IEEE/ACM 32ND INTERNATIONAL SYMPOSIUM ON QUALITY OF SERVICE, IWQOS, 2024,
  • [13] Workflow Scheduling in Multi-Tenant Cloud Computing Environments
    Rimal, Bhaskar Prasad
    Maier, Martin
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2017, 28 (01) : 290 - 304
  • [14] Elastic Deep Learning in Multi-Tenant GPU Clusters
    Wu, Yidi
    Ma, Kaihao
    Yan, Xiao
    Liu, Zhi
    Cai, Zhenkun
    Huang, Yuzhen
    Cheng, James
    Yuan, Han
    Yu, Fan
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2022, 33 (01) : 144 - 158
  • [15] A Multi-Tenant Level Lightweight Lock Mechanism for Multi-Tenant Database
    Kang, Tao
    Zhang, Shidong
    Kong, Lanju
    2014 11th Web Information System and Application Conference (WISA), 2014, : 3 - 7
  • [16] On Scheduling Ring-All-Reduce Learning Jobs in Multi-Tenant GPU Clusters with Communication Contention
    Yu, Menglu
    Ji, Bo
    Rajan, Hridesh
    Liu, Jia
    PROCEEDINGS OF THE 2022 THE TWENTY-THIRD INTERNATIONAL SYMPOSIUM ON THEORY, ALGORITHMIC FOUNDATIONS, AND PROTOCOL DESIGN FOR MOBILE NETWORKS AND MOBILE COMPUTING, MOBIHOC 2022, 2022, : 21 - 30
  • [17] Scheduling multi-tenant cloud workflow tasks with resource reliability
    Xiaoping LI
    Dongyuan PAN
    Yadi WANG
    Rubén RUIZ
    ScienceChina(InformationSciences), 2022, 65 (09) : 127 - 144
  • [18] Towards Fair and Firm Real-Time Scheduling in DNN Multi-Tenant Multi-Accelerator Systems via Reinforcement Learning
    Russo, Enrico
    Blanco, Francesco Giulio
    Palesi, Maurizio
    Ascia, Giuseppe
    Patti, Davide
    Catania, Vincenzo
    2024 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, ISCAS 2024, 2024,
  • [19] Scheduling multi-tenant cloud workflow tasks with resource reliability
    Xiaoping Li
    Dongyuan Pan
    Yadi Wang
    Rubén Ruiz
    Science China Information Sciences, 2022, 65
  • [20] Adaptive task scheduling method in multi-tenant cloud computing
    Ramegowda A.
    Agarkhed J.
    Patil S.R.
    International Journal of Information Technology, 2020, 12 (4) : 1093 - 1102