GMLake: Efficient and Transparent GPU Memory Defragmentation for Large-scale DNN Training with Virtual Memory Stitching

被引:1
|
作者
Guo, Cong [1 ]
Zhang, Rui [2 ]
Xu, Jiale [1 ]
Leng, Jingwen [1 ]
Liu, Zihan [1 ]
Huang, Ziyu [1 ]
Guo, Minyi [1 ]
Wu, Hao [2 ]
Zhao, Shouren [2 ]
Zhao, Junping [2 ]
Zhang, Ke [2 ]
机构
[1] Shanghai Jiao Tong Univ, Shanghai Qi Zhi Inst, Shanghai, Peoples R China
[2] Ant Grp, Hangzhou, Peoples R China
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
Memory Defragmentation; GPU; Deep Learning; Virtual Memory Stitching;
D O I
10.1145/3620665.3640423
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large-scale deep neural networks (DNNs), such as large language models (LLMs), have revolutionized the artificial intelligence (AI) field and become increasingly popular. However, training or fine-tuning such models requires substantial computational power and resources, where the memory capacity of a single acceleration device like a GPU is one of the most important bottlenecks. Owing to the prohibitively large overhead (e.g., 10x) of GPUs' native memory allocator, DNN frameworks like PyTorch and TensorFlow adopt a caching allocator that maintains a memory pool with a splitting mechanism for fast memory (de)allocation. Unfortunately, the caching allocator's efficiency degrades quickly for popular memory reduction techniques such as recomputation, offloading, distributed training, and low-rank adaptation. The primary reason is that those memory reduction techniques introduce frequent and irregular memory (de)allocation requests, leading to severe fragmentation problems for the splitting-based caching allocator. To mitigate this fragmentation problem, we propose a novel memory allocation framework based on low-level GPU virtual memory management called GPU memory lake (GMLake). GMLake employs a novel virtual memory stitching (VMS) mechanism, which can fuse or combine non-contiguous memory blocks with a virtual memory address mapping. GMLake can reduce average of 9.2 GB (up to 25 GB) GPU memory usage and 15% (up to 33%) fragmentation among eight LLM models on GPU A100 with 80 GB memory. GMLake is completely transparent to the DNN models and memory reduction techniques and ensures the seamless execution of resource-intensive deep-learning tasks. We have opensourced GMLake at https://github.com/intelligent-machinelearning/glake/tree/main/GMLake.
引用
收藏
页码:450 / 466
页数:17
相关论文
共 50 条
  • [1] Efficient Use of GPU Memory for Large-Scale Deep Learning Model Training
    Choi, Hyeonseong
    Lee, Jaehwan
    APPLIED SCIENCES-BASEL, 2021, 11 (21):
  • [2] Zico: Efficient GPU Memory Sharing for Concurrent DNN Training
    Lim, Gangmuk
    Ahn, Jeongseob
    Xiao, Wencong
    Kwon, Youngjin
    Jeon, Myeongjae
    PROCEEDINGS OF THE 2021 USENIX ANNUAL TECHNICAL CONFERENCE, 2021, : 523 - 536
  • [3] Training large-scale language models with limited GPU memory: a survey
    Yu TANG
    Linbo QIAO
    Lujia YIN
    Peng LIANG
    Ao SHEN
    Zhilin YANG
    Lizhi ZHANG
    Dongsheng LI
    Frontiers of Information Technology & Electronic Engineering, 2025, 26 (03) : 309 - 331
  • [4] Training large-scale language models with limited GPU memory: a survey
    Tang, Yu
    Qiao, Linbo
    Yin, Lujia
    Liang, Peng
    Shen, Ao
    Yang, Zhilin
    Zhang, Lizhi
    Li, Dongsheng
    FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2025, : 309 - 331
  • [5] mCAP: Memory-Centric Partitioning for Large-Scale Pipeline-Parallel DNN Training
    Dreuning, Henk
    Bal, Henri E.
    van Nieuwpoort, Rob, V
    EURO-PAR 2022: PARALLEL PROCESSING, 2022, 13440 : 155 - 170
  • [6] EFFICIENT MEMORY ACCESS IN LARGE-SCALE COMPUTATION
    VITTER, JS
    LECTURE NOTES IN COMPUTER SCIENCE, 1991, 480 : 26 - 41
  • [7] Waterwave: A GPU Memory Flow Engine for Concurrent DNN Training
    Shi, Xuanhua
    Peng, Xuan
    He, Ligang
    Zhao, Yunfei
    Jin, Hai
    IEEE TRANSACTIONS ON COMPUTERS, 2023, 72 (10) : 2938 - 2950
  • [8] Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads
    Jeon, Myeongjae
    Venkataraman, Shivaram
    Phanishayee, Amar
    Qian, Junjie
    Xiao, Wencong
    Yang, Fan
    PROCEEDINGS OF THE 2019 USENIX ANNUAL TECHNICAL CONFERENCE, 2019, : 947 - 960
  • [9] LARGE-SCALE PARTICLE SIMULATIONS IN A VIRTUAL MEMORY COMPUTER
    GRAY, PC
    WAGNER, JS
    TAJIMA, T
    MILLION, R
    COMPUTER PHYSICS COMMUNICATIONS, 1983, 30 (02) : 109 - 120
  • [10] Occamy: Memory-efficient GPU Compiler for DNN Inference
    Lee, Jaeho
    Jeong, Shinnung
    Song, Seungbin
    Kim, Kunwoo
    Choi, Heelim
    Kim, Youngsok
    Kim, Hanjun
    2023 60TH ACM/IEEE DESIGN AUTOMATION CONFERENCE, DAC, 2023,