CoFB: latency-constrained co-scheduling of flows and batches for deep learning inference service on the CPU-GPU system

被引：1

作者：

Zhang, Qi ^{[1
]}

Liu, Yi ^{[1
]}

Liu, Tao ^{[2
]}

Qian, Depei ^{[1
]}

机构：

[1] Beihang Univ, Sch Comp Sci & Engn, 37 Xueyuan Rd, Beijing 100190, Peoples R China

[2] Shandong Prov Key Lab Comp Networks, 28666 Jingshi Dong Lu, Jinan 250103, Shandong, Peoples R China

来源：

JOURNAL OF SUPERCOMPUTING | 2023年 / 79卷 / 13期

关键词：

Deep learning; Inference; Quality of service; Tail latency; GPU;

D O I：

10.1007/s11227-023-05183-6

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Recent years have witnessed significant achievements in deep learning (DL) technologies. In the meantime, an increasing number of online service operators take advantage of deep learning to provide intelligent and personalized services. Although significant efforts have been put into optimizing the inference efficiency, our investigation shows that for many DL models that process data-intensive requests, the network I/O subsystem also plays an essential role in determining responsiveness. Furthermore, under the latency constraint, uncontrolled network flow processing will impact request batching. Based on the above observation, this paper proposes CoFB, an inference service system that optimizes performance in a holistic way. CoFB improves the load imbalance in the network I/O subsystem with a lightweight flow scheduling scheme that collaborates the network interface card with a dispatcher thread. In addition, CoFB introduces a request reordering and batching policy and an interference-aware concurrent batch throttling strategy for enforcing inference concerning the deadline. We evaluate CoFB on four DL inference services and compare it to two state-of-the-art inference systems: NVIDIA Triton and DVABatch. Experimental results show that CoFB outperforms these two baselines by serving up to 2.69x and 1.96x higher load under preset tail latency objectives, respectively.

引用

页码：14172 / 14199

页数：28

共 7 条

[1] CoFB: latency-constrained co-scheduling of flows and batches for deep learning inference service on the CPU–GPU system
Qi Zhang
Yi Liu
Tao Liu
Depei Qian
The Journal of Supercomputing, 2023, 79 : 14172 - 14199
[2] Co-Scheduling on Fused CPU-GPU Architectures With Shared Last Level Caches
Damschen, Marvin
Mueller, Frank
Henkel, Joerg
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2018, 37 (11) : 2337 - 2347
[3] Orchestrated Co-scheduling, Resource Partitioning, and Power Capping on CPU-GPU Heterogeneous Systems via Machine Learning
Saba, Issa
Arima, Eishi
Liu, Dai
Schulz, Martin
ARCHITECTURE OF COMPUTING SYSTEMS, ARCS 2022, 2022, 13642 : 51 - 67
[4] Demystifying the TensorFlow Eager Execution of Deep Learning Inference on a CPU-GPU Tandem
Delestrac, Paul
Torres, Lionel
Novo, David
2022 25TH EUROMICRO CONFERENCE ON DIGITAL SYSTEM DESIGN (DSD), 2022, : 446 - 455
[5] The Best of Many Worlds: Scheduling Machine Learning Inference on CPU-GPU Integrated Architectures
Vasiliadis, Giorgos
Tsirbas, Rafail
Ioannidis, Sotiris
2022 IEEE 36TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW 2022), 2022, : 55 - 64
[6] Fault-tolerant deep learning inference on CPU-GPU integrated edge devices with TEEs
Xu, Hongjian
Liao, Longlong
Liu, Xinqi
Chen, Shuguang
Chen, Jianguo
Liang, Zhixuan
Yu, Yuanlong
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2024, 161 : 404 - 414
[7] DeepBoot: Dynamic Scheduling System for Training and Inference Deep Learning Tasks in GPU Cluster
Chen, Zhenqian
Zhao, Xinkui
Zhi, Chen
Yin, Jianwei
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2023, 34 (09) : 2553 - 2567

← 1 →