BARM: A Batch-Aware Resource Manager for Boosting Multiple Neural Networks Inference on GPUs With Memory Oversubscription

被引:1
|
作者
Qiu, Zhao-Wei [1 ]
Liu, Kun-Sheng [1 ]
Chen, Ya-Shu [1 ]
机构
[1] Natl Taiwan Univ Sci & Technol, Dept Elect Engn, Taipei 10607, Taiwan
关键词
Graphics processing units; Neural networks; Memory management; Resource management; Kernel; Registers; Random access memory; Memory oversubscription; memory management; memory thrashing; SUPPORT;
D O I
10.1109/TPDS.2022.3199806
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Modern intelligent devices usually execute multiple neural networks to improve service quality. However, system performance degrades significantly when the working set exceeds the physical memory capability, a phenomenon called memory oversubscription. To support the execution of multiple independent neural networks with limited physical memory, this article explores resource management in GPUs with unified virtual memory and demand paging. We first analyze the relationship between the simultaneous execution of multiple neural networks from streaming multiprocessors (SM) assignment and page fault overhead from memory thrashing. To boost performance by reducing the page fault penalty, we propose a batch-aware resource management approach, BARM, including (1) batch-aware SM resource allocation to increase the batch size and (2) thrashing-preventing memory allocation to eliminate run-time thrashing. The performance of the proposed method was evaluated using a series of workloads, and response latency is reduced significantly over the state-of-the-art page fault prefetcher and batch-aware TLP management. The proposed framework was also implemented on the real platform and evaluated by a case study, and impressive results were obtained.
引用
收藏
页码:4612 / 4624
页数:13
相关论文
共 5 条
  • [1] Batch-Aware Unified Memory Management in GPUs for Irregular Workloads
    Kim, Hyojong
    Sim, Jaewoong
    Gera, Prasun
    Hadidi, Ramyad
    Kim, Hyesoon
    TWENTY-FIFTH INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS (ASPLOS XXV), 2020, : 1357 - 1370
  • [2] Integration of millions of transcriptomes using batch-aware triplet neural networks
    Simon, Lukas M.
    Wang, Yin-Ying
    Zhao, Zhongming
    NATURE MACHINE INTELLIGENCE, 2021, 3 (08) : 705 - +
  • [3] Integration of millions of transcriptomes using batch-aware triplet neural networks
    Lukas M. Simon
    Yin-Ying Wang
    Zhongming Zhao
    Nature Machine Intelligence, 2021, 3 : 705 - 715
  • [4] TORRES: A Resource-Efficient Inference Processor for Binary Convolutional Neural Networks Based on Locality-Aware Operation Skipping
    Lee, Su-Jung
    Kwak, Gil-Ho
    Kim, Tae-Hwan
    ELECTRONICS, 2022, 11 (21)
  • [5] Self-attention convolutional neural network optimized with Remora Optimization Algorithm for energy aware resource management in Non-orthogonal Multiple Access networks
    Dani, W. Vinil
    Praveen, K. V.
    Murugesan, S.
    Ramshankar, N.
    INTERNATIONAL JOURNAL OF COMMUNICATION SYSTEMS, 2024, 37 (12)