Looking Beyond GPUs for DNN Scheduling on Multi-Tenant Clusters

被引:0
|
作者
Mohan, Jayashree [1 ]
Phanishayee, Amar [1 ]
Kulkarni, Janardhan [1 ]
Chidambaram, Vijay [2 ,3 ]
机构
[1] Microsoft Res, Redmond, WA 98052 USA
[2] Univ Texas Austin, Austin, TX USA
[3] VMware Res, Palo Alto, CA USA
关键词
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Training Deep Neural Networks (DNNs) is a popular workload in both enterprises and cloud data centers. Existing schedulers for DNN training consider GPU as the dominant resource and allocate other resources such as CPU and memory proportional to the number of GPUs requested by the job. Unfortunately, these schedulers do not consider the impact of a job's sensitivity to allocation of CPU and memory resources. In this work, we propose Synergy, a resource-sensitive scheduler for shared GPU clusters. Synergy infers the sensitivity of DNNs to different resources using optimistic profiling; some jobs might benefit from more than the GPU-proportional allocation and some jobs might not be affected by less than GPU-proportional allocation. Synergy performs such multi-resource workload-aware assignments across a set of jobs scheduled on shared multi-tenant clusters using a new near-optimal online algorithm. Our experiments show that workload-aware CPU and memory allocations can improve average job completion time by upto 3.4x, by better utilizing existing cluster resources, compared to traditional GPU-proportional scheduling.
引用
收藏
页码:579 / 596
页数:18
相关论文
共 50 条
  • [1] Daphne: A Flexible and Hybrid Scheduling Framework in Multi-Tenant Clusters
    Xia, Yiqian
    Ren, Rui
    Cai, Hongming
    Vasilakos, Athanasios V.
    Lv, Zheng
    IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, 2018, 15 (01): : 330 - 343
  • [2] KubeSphere: An Approach to Multi-Tenant Fair Scheduling for Kubernetes Clusters
    Beltre, Angel
    Saha, Pankaj
    Govindaraju, Madhusudhan
    2019 3RD IEEE INTERNATIONAL CONFERENCE ON CLOUD AND FOG COMPUTING TECHNOLOGIES AND APPLICATIONS (IEEE CLOUD SUMMIT 2019), 2019, : 14 - 20
  • [3] Automated Runtime-Aware Scheduling for Multi-Tenant DNN Inference on GPU
    Yu, Fuxun
    Bray, Shawn
    Wang, Di
    Shangguan, Longfei
    Tang, Xulong
    Liu, Chenchen
    Chen, Xiang
    2021 IEEE/ACM INTERNATIONAL CONFERENCE ON COMPUTER AIDED DESIGN (ICCAD), 2021,
  • [4] Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads
    Jeon, Myeongjae
    Venkataraman, Shivaram
    Phanishayee, Amar
    Qian, Junjie
    Xiao, Wencong
    Yang, Fan
    PROCEEDINGS OF THE 2019 USENIX ANNUAL TECHNICAL CONFERENCE, 2019, : 947 - 960
  • [5] Phurti: Application and Network-Aware Flow Scheduling for Multi-Tenant MapReduce Clusters
    Cai, Chris X.
    Saeed, Shayan
    Gupta, Indranil
    Campbell, Roy H.
    Le, Franck
    PROCEEDINGS 2016 IEEE INTERNATIONAL CONFERENCE ON CLOUD ENGINEERING (IC2E), 2016, : 161 - 170
  • [6] PPS: Fair and efficient black-box scheduling for multi-tenant GPU clusters
    Ma, Kaihao
    Cai, Zhenkun
    Yan, Xiao
    Zhang, Yang
    Liu, Zhi
    Feng, Yihui
    Li, Chao
    Lin, Wei
    Cheng, James
    PARALLEL COMPUTING, 2024, 120
  • [7] PPS: Fair and efficient black-box scheduling for multi-tenant GPU clusters
    Ma, Kaihao
    Cai, Zhenkun
    Yan, Xiao
    Zhang, Yang
    Liu, Zhi
    Feng, Yihui
    Li, Chao
    Lin, Wei
    Cheng, James
    Parallel Computing, 2024, 120
  • [8] Adaptive Kernel Merge and Fusion for Multi-Tenant Inference in Embedded GPUs
    Jeon, Jaebeom
    Koo, Gunjae
    Yoon, Myung Kuk
    Oh, Yunho
    IEEE EMBEDDED SYSTEMS LETTERS, 2024, 16 (04) : 421 - 424
  • [9] OPTiC: Opportunistic Graph Processing in Multi-Tenant Clusters
    Rahman, Muntasir Raihan
    Gupta, Indranil
    Kapoor, Akash
    Ding, Haozhen
    2018 IEEE INTERNATIONAL CONFERENCE ON CLOUD ENGINEERING (IC2E 2018), 2018, : 113 - 123
  • [10] Multi-tenant virtual GPUs for optimising performance of a financial risk application
    Prades, Javier
    Varghese, Blesson
    Reano, Carlos
    Silla, Federico
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2017, 108 : 28 - 44