GMLake: Efficient and Transparent GPU Memory Defragmentation for Large-scale DNN Training with Virtual Memory Stitching

被引：1

作者：

Guo, Cong ^{[1
]}

Zhang, Rui ^{[2
]}

Xu, Jiale ^{[1
]}

Leng, Jingwen ^{[1
]}

Liu, Zihan ^{[1
]}

Huang, Ziyu ^{[1
]}

Guo, Minyi ^{[1
]}

Wu, Hao ^{[2
]}

Zhao, Shouren ^{[2
]}

Zhao, Junping ^{[2
]}

Zhang, Ke ^{[2
]}

机构：

[1] Shanghai Jiao Tong Univ, Shanghai Qi Zhi Inst, Shanghai, Peoples R China

[2] Ant Grp, Hangzhou, Peoples R China

来源：

PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS, ASPLOS 2024, VOL 2 | 2024年

基金：

中国国家自然科学基金; 国家重点研发计划;

关键词：

Memory Defragmentation; GPU; Deep Learning; Virtual Memory Stitching;

D O I：

10.1145/3620665.3640423

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Large-scale deep neural networks (DNNs), such as large language models (LLMs), have revolutionized the artificial intelligence (AI) field and become increasingly popular. However, training or fine-tuning such models requires substantial computational power and resources, where the memory capacity of a single acceleration device like a GPU is one of the most important bottlenecks. Owing to the prohibitively large overhead (e.g., 10x) of GPUs' native memory allocator, DNN frameworks like PyTorch and TensorFlow adopt a caching allocator that maintains a memory pool with a splitting mechanism for fast memory (de)allocation. Unfortunately, the caching allocator's efficiency degrades quickly for popular memory reduction techniques such as recomputation, offloading, distributed training, and low-rank adaptation. The primary reason is that those memory reduction techniques introduce frequent and irregular memory (de)allocation requests, leading to severe fragmentation problems for the splitting-based caching allocator. To mitigate this fragmentation problem, we propose a novel memory allocation framework based on low-level GPU virtual memory management called GPU memory lake (GMLake). GMLake employs a novel virtual memory stitching (VMS) mechanism, which can fuse or combine non-contiguous memory blocks with a virtual memory address mapping. GMLake can reduce average of 9.2 GB (up to 25 GB) GPU memory usage and 15% (up to 33%) fragmentation among eight LLM models on GPU A100 with 80 GB memory. GMLake is completely transparent to the DNN models and memory reduction techniques and ensures the seamless execution of resource-intensive deep-learning tasks. We have opensourced GMLake at https://github.com/intelligent-machinelearning/glake/tree/main/GMLake.

引用

页码：450 / 466

页数：17

共 50 条

[1] Efficient Use of GPU Memory for Large-Scale Deep Learning Model Training
Choi, Hyeonseong
Lee, Jaehwan
APPLIED SCIENCES-BASEL, 2021, 11 (21):
[2] Zico: Efficient GPU Memory Sharing for Concurrent DNN Training
Lim, Gangmuk
Ahn, Jeongseob
Xiao, Wencong
Kwon, Youngjin
Jeon, Myeongjae
PROCEEDINGS OF THE 2021 USENIX ANNUAL TECHNICAL CONFERENCE, 2021, : 523 - 536
[3] Training large-scale language models with limited GPU memory: a survey
Yu TANG
Linbo QIAO
Lujia YIN
Peng LIANG
Ao SHEN
Zhilin YANG
Lizhi ZHANG
Dongsheng LI
Frontiers of Information Technology & Electronic Engineering, 2025, 26 (03) : 309 - 331
[4] Training large-scale language models with limited GPU memory: a survey
Tang, Yu
Qiao, Linbo
Yin, Lujia
Liang, Peng
Shen, Ao
Yang, Zhilin
Zhang, Lizhi
Li, Dongsheng
FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2025, : 309 - 331
[5] mCAP: Memory-Centric Partitioning for Large-Scale Pipeline-Parallel DNN Training
Dreuning, Henk
Bal, Henri E.
van Nieuwpoort, Rob, V
EURO-PAR 2022: PARALLEL PROCESSING, 2022, 13440 : 155 - 170
[6] EFFICIENT MEMORY ACCESS IN LARGE-SCALE COMPUTATION
VITTER, JS
LECTURE NOTES IN COMPUTER SCIENCE, 1991, 480 : 26 - 41
[7] Waterwave: A GPU Memory Flow Engine for Concurrent DNN Training
Shi, Xuanhua
Peng, Xuan
He, Ligang
Zhao, Yunfei
Jin, Hai
IEEE TRANSACTIONS ON COMPUTERS, 2023, 72 (10) : 2938 - 2950
[8] Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads
Jeon, Myeongjae
Venkataraman, Shivaram
Phanishayee, Amar
Qian, Junjie
Xiao, Wencong
Yang, Fan
PROCEEDINGS OF THE 2019 USENIX ANNUAL TECHNICAL CONFERENCE, 2019, : 947 - 960
[9] LARGE-SCALE PARTICLE SIMULATIONS IN A VIRTUAL MEMORY COMPUTER
GRAY, PC
WAGNER, JS
TAJIMA, T
MILLION, R
COMPUTER PHYSICS COMMUNICATIONS, 1983, 30 (02) : 109 - 120
[10] Occamy: Memory-efficient GPU Compiler for DNN Inference
Lee, Jaeho
Jeong, Shinnung
Song, Seungbin
Kim, Kunwoo
Choi, Heelim
Kim, Youngsok
Kim, Hanjun
2023 60TH ACM/IEEE DESIGN AUTOMATION CONFERENCE, DAC, 2023,

← 1 2 3 4 5 →