iGniter:Interference-Aware GPU Resource Provisioning for Predictable DNN Inference in the Cloud

被引:16
|
作者
Xu, Fei [1 ]
Xu, Jianian [1 ]
Chen, Jiabin [1 ]
Chen, Li [2 ]
Shang, Ruitao [1 ]
Zhou, Zhi [3 ]
Liu, Fangming [4 ,5 ]
机构
[1] East China Normal Univ, Sch Comp Sci & Technol, Shanghai Key Lab Multidimens Informat Proc, Shanghai 200062, Peoples R China
[2] Univ Louisiana, Sch Comp & Informat, Lafayette, LA 70504 USA
[3] Sun Yat Sen Univ, Sch Comp Sci & Engn, Guangdong Key Lab Big Data Anal & Proc, Guangzhou 510006, Guangdong, Peoples R China
[4] Peng Cheng Lab, Shenzhen 518066, Guangdong, Peoples R China
[5] Huazhong Univ Sci & Technol, Wuhan 430074, Hubei, Peoples R China
基金
美国国家科学基金会; 中国国家自然科学基金;
关键词
Graphics processing units; Interference; Resource management; Kernel; Performance evaluation; Delays; Adaptation models; Cloud-based DNN inference; predictable performance; GPU resource provisioning; performance interference;
D O I
10.1109/TPDS.2022.3232715
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
GPUs are essential to accelerating the latency-sensitive deep neural network (DNN) inference workloads in cloud datacenters. To fully utilize GPU resources, spatial sharing of GPUs among co-located DNN inference workloads becomes increasingly compelling. However, GPU sharing inevitably brings severe performance interference among co-located inference workloads, as motivated by an empirical measurement study of DNN inference on EC2 GPU instances. While existing works on guaranteeing inference performance service level objectives (SLOs) focus on either temporal sharing of GPUs or reactive GPU resource scaling and inference migration techniques, how to proactively mitigate such severe performance interference has received comparatively little attention. In this paper, we propose iGniter, an interference-aware GPU resource provisioning framework for cost-efficiently achieving predictable DNN inference in the cloud. iGniter is comprised of two key components: (1) a lightweight DNN inference performance model, which leverages the system and workload metrics that are practically accessible to capture the performance interference; (2) A cost-efficient GPU resource provisioning strategy that jointly optimizes the GPU resource allocation and adaptive batching based on our inference performance model, with the aim of achieving predictable performance of DNN inference workloads. We implement a prototype of iGniter based on the NVIDIA Triton inference server hosted on EC2 GPU instances. Extensive prototype experiments on four representative DNN models and datasets demonstrate that iGniter can guarantee the performance SLOs of DNN inference workloads with practically acceptable runtime overhead, while saving the monetary cost by up to $25\%$25% in comparison to the state-of-the-art GPU resource provisioning strategies.
引用
收藏
页码:812 / 827
页数:16
相关论文
共 50 条
  • [31] HARMONY: Dynamic Heterogeneity-Aware Resource Provisioning in the Cloud
    Zhang, Qi
    Zhani, Mohamed Faten
    Boutaba, Raouf
    Hellerstein, Joseph L.
    2013 IEEE 33RD INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS), 2013, : 510 - 519
  • [32] Interference-Aware Resource Scheduling in LTE HetNets with Carrier Aggregation Support
    Limani, Zana
    Chiasserini, Carla-Fabiana
    Dell'Aera, Gian Michele
    2015 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC), 2015, : 3137 - 3142
  • [33] Interference-Aware Radio Resource Management for Local Area Wireless Networks
    Pekka Jänis
    Visa Koivunen
    Cássio B. Ribeiro
    EURASIP Journal on Wireless Communications and Networking, 2011
  • [34] Interference-aware picocells coordination resource allocation in heterogeneous OFDMA networks
    Zhao, Jun
    Lu, Zhaoming
    Wen, Xiangming
    Zheng, Wei
    Zhang, Zhicai
    Jing, Wenpeng
    Journal of Information and Computational Science, 2015, 12 (02): : 475 - 484
  • [35] Interference-Aware Radio Resource Management for Local Area Wireless Networks
    Janis, Pekka
    Koivunen, Visa
    Ribeiro, Cassio B.
    EURASIP JOURNAL ON WIRELESS COMMUNICATIONS AND NETWORKING, 2011,
  • [36] Performance Efficient Layer-aware DNN Inference Task Scheduling in GPU Cluster
    Geng, Hongmin
    Zeng, Deze
    Li, Yuepeng
    2022 IEEE GLOBAL COMMUNICATIONS CONFERENCE (GLOBECOM 2022), 2022, : 2242 - 2247
  • [37] Automated Runtime-Aware Scheduling for Multi-Tenant DNN Inference on GPU
    Yu, Fuxun
    Bray, Shawn
    Wang, Di
    Shangguan, Longfei
    Tang, Xulong
    Liu, Chenchen
    Chen, Xiang
    2021 IEEE/ACM INTERNATIONAL CONFERENCE ON COMPUTER AIDED DESIGN (ICCAD), 2021,
  • [38] Interference-aware D2D-Multicast Session Provisioning in LTE-A Networks
    Bhardwaj, Ajay
    Agnihotri, Samar
    2017 IEEE WIRELESS COMMUNICATIONS AND NETWORKING CONFERENCE (WCNC), 2017,
  • [39] An Interference-aware Resource Allocation Scheme for Self-Organizing Heterogeneous Networks
    Du, Hongyan
    Tian, Lin
    Liu, Ling
    Hou, Zhanwei
    2015 IEEE WIRELESS COMMUNICATIONS AND NETWORKING CONFERENCE WORKSHOPS (WCNCW), 2015, : 265 - 270
  • [40] Interference-aware Component Scheduling for Reducing Tail Latency in Cloud Interactive Services
    Han, Rui
    Wang, Junwei
    Huang, Siguang
    Shao, Chenrong
    Zhan, Shulin
    Zhan, Jianfeng
    Luis Vazquez-Poletti, Jose
    2015 IEEE 35TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS, 2015, : 744 - 745