iGniter:Interference-Aware GPU Resource Provisioning for Predictable DNN Inference in the Cloud

被引：16

作者：

Xu, Fei ^{[1
]}

Xu, Jianian ^{[1
]}

Chen, Jiabin ^{[1
]}

Chen, Li ^{[2
]}

Shang, Ruitao ^{[1
]}

Zhou, Zhi ^{[3
]}

Liu, Fangming ^{[4
,5
]}

机构：

[1] East China Normal Univ, Sch Comp Sci & Technol, Shanghai Key Lab Multidimens Informat Proc, Shanghai 200062, Peoples R China

[2] Univ Louisiana, Sch Comp & Informat, Lafayette, LA 70504 USA

[3] Sun Yat Sen Univ, Sch Comp Sci & Engn, Guangdong Key Lab Big Data Anal & Proc, Guangzhou 510006, Guangdong, Peoples R China

[4] Peng Cheng Lab, Shenzhen 518066, Guangdong, Peoples R China

[5] Huazhong Univ Sci & Technol, Wuhan 430074, Hubei, Peoples R China

来源：

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS | 2023年 / 34卷 / 03期

基金：

美国国家科学基金会; 中国国家自然科学基金;

关键词：

Graphics processing units; Interference; Resource management; Kernel; Performance evaluation; Delays; Adaptation models; Cloud-based DNN inference; predictable performance; GPU resource provisioning; performance interference;

D O I：

10.1109/TPDS.2022.3232715

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

GPUs are essential to accelerating the latency-sensitive deep neural network (DNN) inference workloads in cloud datacenters. To fully utilize GPU resources, spatial sharing of GPUs among co-located DNN inference workloads becomes increasingly compelling. However, GPU sharing inevitably brings severe performance interference among co-located inference workloads, as motivated by an empirical measurement study of DNN inference on EC2 GPU instances. While existing works on guaranteeing inference performance service level objectives (SLOs) focus on either temporal sharing of GPUs or reactive GPU resource scaling and inference migration techniques, how to proactively mitigate such severe performance interference has received comparatively little attention. In this paper, we propose iGniter, an interference-aware GPU resource provisioning framework for cost-efficiently achieving predictable DNN inference in the cloud. iGniter is comprised of two key components: (1) a lightweight DNN inference performance model, which leverages the system and workload metrics that are practically accessible to capture the performance interference; (2) A cost-efficient GPU resource provisioning strategy that jointly optimizes the GPU resource allocation and adaptive batching based on our inference performance model, with the aim of achieving predictable performance of DNN inference workloads. We implement a prototype of iGniter based on the NVIDIA Triton inference server hosted on EC2 GPU instances. Extensive prototype experiments on four representative DNN models and datasets demonstrate that iGniter can guarantee the performance SLOs of DNN inference workloads with practically acceptable runtime overhead, while saving the monetary cost by up to $25\%$25% in comparison to the state-of-the-art GPU resource provisioning strategies.

引用

页码：812 / 827

页数：16

共 50 条

[1] Heterogeneity and Interference-Aware Virtual Machine Provisioning for Predictable Performance in the Cloud
Xu, Fei
Liu, Fangming
Jin, Hai
IEEE TRANSACTIONS ON COMPUTERS, 2016, 65 (08) : 2470 - 2483
[2] On Interference-aware Provisioning for Cloud-based Big Data Processing
Yuan, Yi
Wang, Haiyang
Wang, Dan
Liu, Jiangchuan
2013 IEEE/ACM 21ST INTERNATIONAL SYMPOSIUM ON QUALITY OF SERVICE (IWQOS), 2013, : 201 - 206
[3] Qora: Neural-Enhanced Interference-Aware Resource Provisioning for Serverless Computing
Ma, Ruifeng
Zhan, Yufeng
Wu, Chuge
Hong, Zicong
Ali, Yasir
Xia, Yuanqing
IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, 2025,
[4] Interference-Aware Scheduling for Inference Serving
Mendoza, Daniel
Romero, Francisco
Li, Qian
Yadwadkar, Neeraja J.
Kozyrakis, Christos
PROCEEDINGS OF THE 1ST WORKSHOP ON MACHINE LEARNING AND SYSTEMS (EUROMLSYS'21), 2021, : 80 - 88
[5] Intelligent, Performance Interference-aware Resource Management for IoT Cloud Backends
Caglar, Faruk
Shekhar, Shashank
Gokhale, Aniruddha
Koutsoukos, Xenofon
PROCEEDINGS 2016 IEEE FIRST INTERNATIONAL CONFERENCE ON INTERNET-OF-THINGS DESIGN AND IMPLEMENTATION IOTDI 2016, 2016, : 95 - 105
[6] Interference-aware VM Placement in Cloud
Hossain, Sajjad
Rahman, Md. Mahfuzur
Anwar, Md Musfique
2023 IEEE 43RD INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS, ICDCS, 2023, : 1045 - 1046
[7] Jily: Cost-Aware AutoScaling of Heterogeneous GPU for DNN Inference in Public Cloud
Wang, Zhaoxing
Tang, Xuehai
Liu, Qiuyang
Han, Jizhong
2019 IEEE 38TH INTERNATIONAL PERFORMANCE COMPUTING AND COMMUNICATIONS CONFERENCE (IPCCC), 2019,
[8] Exploiting Interference-aware GPU Container Concurrency Learning from Resource Usage of Application Execution
Kim, Sejin
Kim, Yoonhee
APNOMS 2020: 2020 21ST ASIA-PACIFIC NETWORK OPERATIONS AND MANAGEMENT SYMPOSIUM (APNOMS), 2020, : 173 - 178
[9] ML-driven classification scheme for dynamic interference-aware resource in cloud infrastructures
Meyer, Vinicius
Kirchoff, Dionatra F.
Da Silva, Matheus L.
De Rose, Cesar A. F.
JOURNAL OF SYSTEMS ARCHITECTURE, 2021, 116
[10] Performance, Resource, and Cost Aware Resource Provisioning in the Cloud
Logeswaran, Lajanugen
Bandara, H. M. N. Dilum
Bhathiya, H. S.
PROCEEDINGS OF 2016 IEEE 9TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING (CLOUD), 2016, : 913 - 916

← 1 2 3 4 5 →