Looking Beyond GPUs for DNN Scheduling on Multi-Tenant Clusters

被引:0
|
作者
Mohan, Jayashree [1 ]
Phanishayee, Amar [1 ]
Kulkarni, Janardhan [1 ]
Chidambaram, Vijay [2 ,3 ]
机构
[1] Microsoft Res, Redmond, WA 98052 USA
[2] Univ Texas Austin, Austin, TX USA
[3] VMware Res, Palo Alto, CA USA
来源
PROCEEDINGS OF THE 16TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, OSDI 2022 | 2022年
关键词
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Training Deep Neural Networks (DNNs) is a popular workload in both enterprises and cloud data centers. Existing schedulers for DNN training consider GPU as the dominant resource and allocate other resources such as CPU and memory proportional to the number of GPUs requested by the job. Unfortunately, these schedulers do not consider the impact of a job's sensitivity to allocation of CPU and memory resources. In this work, we propose Synergy, a resource-sensitive scheduler for shared GPU clusters. Synergy infers the sensitivity of DNNs to different resources using optimistic profiling; some jobs might benefit from more than the GPU-proportional allocation and some jobs might not be affected by less than GPU-proportional allocation. Synergy performs such multi-resource workload-aware assignments across a set of jobs scheduled on shared multi-tenant clusters using a new near-optimal online algorithm. Our experiments show that workload-aware CPU and memory allocations can improve average job completion time by upto 3.4x, by better utilizing existing cluster resources, compared to traditional GPU-proportional scheduling.
引用
收藏
页码:579 / 596
页数:18
相关论文
共 50 条
  • [21] Scheduling multi-tenant cloud workflow tasks with resource reliability
    Li, Xiaoping
    Pan, Dongyuan
    Wang, Yadi
    Ruiz, Ruben
    SCIENCE CHINA-INFORMATION SCIENCES, 2022, 65 (09)
  • [22] Efficient network isolation and load balancing in multi-tenant HPC clusters
    Zahid, Feroz
    Gran, Ernst Gunnar
    Bogdanski, Bartosz
    Johnsen, Bjorn Dag
    Skeie, Tor
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2017, 72 : 145 - 162
  • [23] Scheduling dynamic workloads in multi-tenant scientific workflow as a service platforms
    Rodriguez, Maria A.
    Buyya, Rajkumar
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2018, 79 : 739 - 750
  • [24] Acentric Scheduling Strategy for SLA-Based Multi-Tenant Queries
    Zou, Lida
    Li, Qingzhong
    Li, Wenhao
    Kong, Lanju
    INTERNATIONAL JOURNAL OF COOPERATIVE INFORMATION SYSTEMS, 2016, 25 (03)
  • [25] A Power-aware Scheduling Algorithm in Multi-tenant IaaS Clouds
    Liang, Bin
    Dong, Xiaoshe
    Zhang, Xingjun
    2019 IEEE 4TH INTERNATIONAL CONFERENCE ON SIGNAL AND IMAGE PROCESSING (ICSIP 2019), 2019, : 250 - 253
  • [26] Scheduling Multi-tenant Cloud Workloads on Accelerator-based Systems
    Sengupta, Dipanjan
    Goswami, Anshuman
    Schwan, Karsten
    Pallavi, Krishna
    SC14: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2014, : 513 - 524
  • [27] Astraea: A Fair Deep Learning Scheduler for Multi-Tenant GPU Clusters
    Ye, Zhisheng
    Sun, Peng
    Gao, Wei
    Zhang, Tianwei
    Wang, Xiaolin
    Yan, Shengen
    Luo, Yingwei
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2022, 33 (11) : 2781 - 2793
  • [28] Network-Aware Container Scheduling in Multi-Tenant Data Center
    Rodrigues, Leonardo R.
    Pasin, Marcelo
    Alves, Omir C., Jr.
    Miers, Charles C.
    Pillon, Mauricio A.
    Felber, Pascal
    Koslovski, Guilherme P.
    2019 IEEE GLOBAL COMMUNICATIONS CONFERENCE (GLOBECOM), 2019,
  • [29] A SLA-based Scheduling Approach for Multi-tenant Cloud Simulation
    Peng, Gongzhuang
    Zhao, Jiaxin
    Li, Minghui
    Hou, Baocun
    Zhang, Heming
    PROCEEDINGS OF THE 2015 IEEE 19TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN (CSCWD), 2015, : 600 - 605
  • [30] ReLAQS: Reducing Latency for Multi-Tenant Approximate Queries via Scheduling
    Stafman, Logan
    Or, Andrew
    Freedman, Michael J.
    MIDDLEWARE'19: PROCEEDINGS OF THE 2019 MIDDLEWARE'19: 20TH INTERNATIONAL MIDDLEWARE CONFERENCE, 2019, : 280 - 292