Coordinated Batching and DVFS for DNN Inference on GPU Accelerators

被引:35
|
作者
Nabavinejad, Seyed Morteza [1 ]
Reda, Sherief [2 ]
Ebrahimi, Masoumeh [3 ]
机构
[1] Inst Res Fundamental Sci IPM, Sch Comp Sci, Tehran 1953833511, Iran
[2] Brown Univ, Sch Engn, Providence, RI 02912 USA
[3] KTH Royal Inst Technol, S-11428 Stockholm, Sweden
基金
瑞典研究理事会;
关键词
Throughput; Graphics processing units; Power demand; Runtime; Bayes methods; Resource management; Optimization; Deep neural networks; GPU accelerator; power consumption; throughput; batch size; dynamic voltage frequency scaling;
D O I
10.1109/TPDS.2022.3144614
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Employing hardware accelerators to improve the performance and energy-efficiency of DNN applications is on the rise. One challenge of using hardware accelerators, including the GPU-based ones, is that their performance is limited by internal and external factors, such as power caps. A common approach to meet the power cap constraint is using the Dynamic Voltage Frequency Scaling (DVFS) technique. However, the functionally of this technique is limited and platform-dependent. To tackle this challenge, we propose a new control knob, which is the size of input batches fed to the GPU accelerator in DNN inference applications. We first evaluate the impact of batch size on power consumption and performance of DNN inference. Then, we introduce the design and implementation of a fast and lightweight runtime system, called BatchDVFS. Dynamic batching is implemented in BatchDVFS to adaptively change the batch size, and hence, trade-off throughput with power consumption. It employs an approach based on binary search to find the proper batch size within a short period of time. Combining dynamic batching with the DVFS technique, BatchDVFS can control the power consumption in wider ranges, and hence, yield higher throughput in the presence of power caps. To find near-optimal solution for long-running jobs that can afford a relatively significant profiling overhead, compared with BatchDVFS overhead, we also design an approach, called BOBD, that employs Bayesian Optimization to wisely explore the vast state space resulted by combination of the batch size and DVFS solutions. Conducting several experiments using a modern GPU and several DNN models and input datasets, we show that our BatchDVFS can significantly surpass the techniques solely based on DVFS or batching, regarding throughput (up to 11.2x and 2.2x, respectively), while successfully meeting the power cap.
引用
收藏
页码:2496 / 2508
页数:13
相关论文
共 50 条
  • [21] Performance Efficient Layer-aware DNN Inference Task Scheduling in GPU Cluster
    Geng, Hongmin
    Zeng, Deze
    Li, Yuepeng
    2022 IEEE GLOBAL COMMUNICATIONS CONFERENCE (GLOBECOM 2022), 2022, : 2242 - 2247
  • [22] Towards Fast GPU-based Sparse DNN Inference: A Hybrid Compute Model
    Xu, Shaoxian
    Wu, Minkang
    Zheng, Long
    Shao, Zhiyuan
    Ye, Xiangyu
    Liao, Xiaofei
    Jin, Hai
    2022 IEEE HIGH PERFORMANCE EXTREME COMPUTING VIRTUAL CONFERENCE (HPEC), 2022,
  • [23] Energy Profiling of DNN Accelerators
    Wess, Matthias
    Dallinger, Dominik
    Schnoell, Daniel
    Bittner, Matthias
    Goetzinger, Maximilian
    Jantsch, Axel
    2023 26TH EUROMICRO CONFERENCE ON DIGITAL SYSTEM DESIGN, DSD 2023, 2023, : 53 - 60
  • [24] iGniter:Interference-Aware GPU Resource Provisioning for Predictable DNN Inference in the Cloud
    Xu, Fei
    Xu, Jianian
    Chen, Jiabin
    Chen, Li
    Shang, Ruitao
    Zhou, Zhi
    Liu, Fangming
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2023, 34 (03) : 812 - 827
  • [25] Automated Runtime-Aware Scheduling for Multi-Tenant DNN Inference on GPU
    Yu, Fuxun
    Bray, Shawn
    Wang, Di
    Shangguan, Longfei
    Tang, Xulong
    Liu, Chenchen
    Chen, Xiang
    2021 IEEE/ACM INTERNATIONAL CONFERENCE ON COMPUTER AIDED DESIGN (ICCAD), 2021,
  • [26] Jily: Cost-Aware AutoScaling of Heterogeneous GPU for DNN Inference in Public Cloud
    Wang, Zhaoxing
    Tang, Xuehai
    Liu, Qiuyang
    Han, Jizhong
    2019 IEEE 38TH INTERNATIONAL PERFORMANCE COMPUTING AND COMMUNICATIONS CONFERENCE (IPCCC), 2019,
  • [27] Model sharing for GPU-accelerated DNN inference in big data processing systems
    Ding G.
    Chen Q.
    Xu C.
    Qian W.
    Zhou A.
    Qinghua Daxue Xuebao/Journal of Tsinghua University, 2022, 62 (09): : 1435 - 1441
  • [28] Exploring In-Memory Accelerators and FPGAs for Latency-Sensitive DNN Inference on Edge Servers
    Suvizi, Ali
    Subramaniam, Suresh
    Lan, Tian
    Venkataramani, Guru
    2024 IEEE CLOUD SUMMIT, CLOUD SUMMIT 2024, 2024, : 1 - 6
  • [29] Queueing analysis of GPU-based inference servers with dynamic batching: A closed-form characterization
    Inoue, Yoshiaki
    PERFORMANCE EVALUATION, 2021, 147
  • [30] Serving DNN Inference With Fine-Grained Spatio-Temporal Sharing of GPU Servers
    Peng, Yaqiong
    Gao, Weiguo
    Peng, Haocheng
    IEEE TRANSACTIONS ON SERVICES COMPUTING, 2024, 17 (06) : 4310 - 4323