CoFB: latency-constrained co-scheduling of flows and batches for deep learning inference service on the CPU-GPU system

被引:1
|
作者
Zhang, Qi [1 ]
Liu, Yi [1 ]
Liu, Tao [2 ]
Qian, Depei [1 ]
机构
[1] Beihang Univ, Sch Comp Sci & Engn, 37 Xueyuan Rd, Beijing 100190, Peoples R China
[2] Shandong Prov Key Lab Comp Networks, 28666 Jingshi Dong Lu, Jinan 250103, Shandong, Peoples R China
来源
JOURNAL OF SUPERCOMPUTING | 2023年 / 79卷 / 13期
关键词
Deep learning; Inference; Quality of service; Tail latency; GPU;
D O I
10.1007/s11227-023-05183-6
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Recent years have witnessed significant achievements in deep learning (DL) technologies. In the meantime, an increasing number of online service operators take advantage of deep learning to provide intelligent and personalized services. Although significant efforts have been put into optimizing the inference efficiency, our investigation shows that for many DL models that process data-intensive requests, the network I/O subsystem also plays an essential role in determining responsiveness. Furthermore, under the latency constraint, uncontrolled network flow processing will impact request batching. Based on the above observation, this paper proposes CoFB, an inference service system that optimizes performance in a holistic way. CoFB improves the load imbalance in the network I/O subsystem with a lightweight flow scheduling scheme that collaborates the network interface card with a dispatcher thread. In addition, CoFB introduces a request reordering and batching policy and an interference-aware concurrent batch throttling strategy for enforcing inference concerning the deadline. We evaluate CoFB on four DL inference services and compare it to two state-of-the-art inference systems: NVIDIA Triton and DVABatch. Experimental results show that CoFB outperforms these two baselines by serving up to 2.69x and 1.96x higher load under preset tail latency objectives, respectively.
引用
收藏
页码:14172 / 14199
页数:28
相关论文
共 7 条
  • [1] CoFB: latency-constrained co-scheduling of flows and batches for deep learning inference service on the CPU–GPU system
    Qi Zhang
    Yi Liu
    Tao Liu
    Depei Qian
    The Journal of Supercomputing, 2023, 79 : 14172 - 14199
  • [2] Co-Scheduling on Fused CPU-GPU Architectures With Shared Last Level Caches
    Damschen, Marvin
    Mueller, Frank
    Henkel, Joerg
    IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2018, 37 (11) : 2337 - 2347
  • [3] Orchestrated Co-scheduling, Resource Partitioning, and Power Capping on CPU-GPU Heterogeneous Systems via Machine Learning
    Saba, Issa
    Arima, Eishi
    Liu, Dai
    Schulz, Martin
    ARCHITECTURE OF COMPUTING SYSTEMS, ARCS 2022, 2022, 13642 : 51 - 67
  • [4] Demystifying the TensorFlow Eager Execution of Deep Learning Inference on a CPU-GPU Tandem
    Delestrac, Paul
    Torres, Lionel
    Novo, David
    2022 25TH EUROMICRO CONFERENCE ON DIGITAL SYSTEM DESIGN (DSD), 2022, : 446 - 455
  • [5] The Best of Many Worlds: Scheduling Machine Learning Inference on CPU-GPU Integrated Architectures
    Vasiliadis, Giorgos
    Tsirbas, Rafail
    Ioannidis, Sotiris
    2022 IEEE 36TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW 2022), 2022, : 55 - 64
  • [6] Fault-tolerant deep learning inference on CPU-GPU integrated edge devices with TEEs
    Xu, Hongjian
    Liao, Longlong
    Liu, Xinqi
    Chen, Shuguang
    Chen, Jianguo
    Liang, Zhixuan
    Yu, Yuanlong
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2024, 161 : 404 - 414
  • [7] DeepBoot: Dynamic Scheduling System for Training and Inference Deep Learning Tasks in GPU Cluster
    Chen, Zhenqian
    Zhao, Xinkui
    Zhi, Chen
    Yin, Jianwei
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2023, 34 (09) : 2553 - 2567