iGniter:Interference-Aware GPU Resource Provisioning for Predictable DNN Inference in the Cloud

被引:16
|
作者
Xu, Fei [1 ]
Xu, Jianian [1 ]
Chen, Jiabin [1 ]
Chen, Li [2 ]
Shang, Ruitao [1 ]
Zhou, Zhi [3 ]
Liu, Fangming [4 ,5 ]
机构
[1] East China Normal Univ, Sch Comp Sci & Technol, Shanghai Key Lab Multidimens Informat Proc, Shanghai 200062, Peoples R China
[2] Univ Louisiana, Sch Comp & Informat, Lafayette, LA 70504 USA
[3] Sun Yat Sen Univ, Sch Comp Sci & Engn, Guangdong Key Lab Big Data Anal & Proc, Guangzhou 510006, Guangdong, Peoples R China
[4] Peng Cheng Lab, Shenzhen 518066, Guangdong, Peoples R China
[5] Huazhong Univ Sci & Technol, Wuhan 430074, Hubei, Peoples R China
基金
美国国家科学基金会; 中国国家自然科学基金;
关键词
Graphics processing units; Interference; Resource management; Kernel; Performance evaluation; Delays; Adaptation models; Cloud-based DNN inference; predictable performance; GPU resource provisioning; performance interference;
D O I
10.1109/TPDS.2022.3232715
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
GPUs are essential to accelerating the latency-sensitive deep neural network (DNN) inference workloads in cloud datacenters. To fully utilize GPU resources, spatial sharing of GPUs among co-located DNN inference workloads becomes increasingly compelling. However, GPU sharing inevitably brings severe performance interference among co-located inference workloads, as motivated by an empirical measurement study of DNN inference on EC2 GPU instances. While existing works on guaranteeing inference performance service level objectives (SLOs) focus on either temporal sharing of GPUs or reactive GPU resource scaling and inference migration techniques, how to proactively mitigate such severe performance interference has received comparatively little attention. In this paper, we propose iGniter, an interference-aware GPU resource provisioning framework for cost-efficiently achieving predictable DNN inference in the cloud. iGniter is comprised of two key components: (1) a lightweight DNN inference performance model, which leverages the system and workload metrics that are practically accessible to capture the performance interference; (2) A cost-efficient GPU resource provisioning strategy that jointly optimizes the GPU resource allocation and adaptive batching based on our inference performance model, with the aim of achieving predictable performance of DNN inference workloads. We implement a prototype of iGniter based on the NVIDIA Triton inference server hosted on EC2 GPU instances. Extensive prototype experiments on four representative DNN models and datasets demonstrate that iGniter can guarantee the performance SLOs of DNN inference workloads with practically acceptable runtime overhead, while saving the monetary cost by up to $25\%$25% in comparison to the state-of-the-art GPU resource provisioning strategies.
引用
收藏
页码:812 / 827
页数:16
相关论文
共 50 条
  • [1] Heterogeneity and Interference-Aware Virtual Machine Provisioning for Predictable Performance in the Cloud
    Xu, Fei
    Liu, Fangming
    Jin, Hai
    IEEE TRANSACTIONS ON COMPUTERS, 2016, 65 (08) : 2470 - 2483
  • [2] On Interference-aware Provisioning for Cloud-based Big Data Processing
    Yuan, Yi
    Wang, Haiyang
    Wang, Dan
    Liu, Jiangchuan
    2013 IEEE/ACM 21ST INTERNATIONAL SYMPOSIUM ON QUALITY OF SERVICE (IWQOS), 2013, : 201 - 206
  • [3] Qora: Neural-Enhanced Interference-Aware Resource Provisioning for Serverless Computing
    Ma, Ruifeng
    Zhan, Yufeng
    Wu, Chuge
    Hong, Zicong
    Ali, Yasir
    Xia, Yuanqing
    IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, 2025,
  • [4] Interference-Aware Scheduling for Inference Serving
    Mendoza, Daniel
    Romero, Francisco
    Li, Qian
    Yadwadkar, Neeraja J.
    Kozyrakis, Christos
    PROCEEDINGS OF THE 1ST WORKSHOP ON MACHINE LEARNING AND SYSTEMS (EUROMLSYS'21), 2021, : 80 - 88
  • [5] Intelligent, Performance Interference-aware Resource Management for IoT Cloud Backends
    Caglar, Faruk
    Shekhar, Shashank
    Gokhale, Aniruddha
    Koutsoukos, Xenofon
    PROCEEDINGS 2016 IEEE FIRST INTERNATIONAL CONFERENCE ON INTERNET-OF-THINGS DESIGN AND IMPLEMENTATION IOTDI 2016, 2016, : 95 - 105
  • [6] Interference-aware VM Placement in Cloud
    Hossain, Sajjad
    Rahman, Md. Mahfuzur
    Anwar, Md Musfique
    2023 IEEE 43RD INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS, ICDCS, 2023, : 1045 - 1046
  • [7] Jily: Cost-Aware AutoScaling of Heterogeneous GPU for DNN Inference in Public Cloud
    Wang, Zhaoxing
    Tang, Xuehai
    Liu, Qiuyang
    Han, Jizhong
    2019 IEEE 38TH INTERNATIONAL PERFORMANCE COMPUTING AND COMMUNICATIONS CONFERENCE (IPCCC), 2019,
  • [8] Exploiting Interference-aware GPU Container Concurrency Learning from Resource Usage of Application Execution
    Kim, Sejin
    Kim, Yoonhee
    APNOMS 2020: 2020 21ST ASIA-PACIFIC NETWORK OPERATIONS AND MANAGEMENT SYMPOSIUM (APNOMS), 2020, : 173 - 178
  • [9] ML-driven classification scheme for dynamic interference-aware resource in cloud infrastructures
    Meyer, Vinicius
    Kirchoff, Dionatra F.
    Da Silva, Matheus L.
    De Rose, Cesar A. F.
    JOURNAL OF SYSTEMS ARCHITECTURE, 2021, 116
  • [10] Performance, Resource, and Cost Aware Resource Provisioning in the Cloud
    Logeswaran, Lajanugen
    Bandara, H. M. N. Dilum
    Bhathiya, H. S.
    PROCEEDINGS OF 2016 IEEE 9TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING (CLOUD), 2016, : 913 - 916