GMLake: Efficient and Transparent GPU Memory Defragmentation for Large-scale DNN Training with Virtual Memory Stitching

被引:1
|
作者
Guo, Cong [1 ]
Zhang, Rui [2 ]
Xu, Jiale [1 ]
Leng, Jingwen [1 ]
Liu, Zihan [1 ]
Huang, Ziyu [1 ]
Guo, Minyi [1 ]
Wu, Hao [2 ]
Zhao, Shouren [2 ]
Zhao, Junping [2 ]
Zhang, Ke [2 ]
机构
[1] Shanghai Jiao Tong Univ, Shanghai Qi Zhi Inst, Shanghai, Peoples R China
[2] Ant Grp, Hangzhou, Peoples R China
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
Memory Defragmentation; GPU; Deep Learning; Virtual Memory Stitching;
D O I
10.1145/3620665.3640423
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large-scale deep neural networks (DNNs), such as large language models (LLMs), have revolutionized the artificial intelligence (AI) field and become increasingly popular. However, training or fine-tuning such models requires substantial computational power and resources, where the memory capacity of a single acceleration device like a GPU is one of the most important bottlenecks. Owing to the prohibitively large overhead (e.g., 10x) of GPUs' native memory allocator, DNN frameworks like PyTorch and TensorFlow adopt a caching allocator that maintains a memory pool with a splitting mechanism for fast memory (de)allocation. Unfortunately, the caching allocator's efficiency degrades quickly for popular memory reduction techniques such as recomputation, offloading, distributed training, and low-rank adaptation. The primary reason is that those memory reduction techniques introduce frequent and irregular memory (de)allocation requests, leading to severe fragmentation problems for the splitting-based caching allocator. To mitigate this fragmentation problem, we propose a novel memory allocation framework based on low-level GPU virtual memory management called GPU memory lake (GMLake). GMLake employs a novel virtual memory stitching (VMS) mechanism, which can fuse or combine non-contiguous memory blocks with a virtual memory address mapping. GMLake can reduce average of 9.2 GB (up to 25 GB) GPU memory usage and 15% (up to 33%) fragmentation among eight LLM models on GPU A100 with 80 GB memory. GMLake is completely transparent to the DNN models and memory reduction techniques and ensures the seamless execution of resource-intensive deep-learning tasks. We have opensourced GMLake at https://github.com/intelligent-machinelearning/glake/tree/main/GMLake.
引用
收藏
页码:450 / 466
页数:17
相关论文
共 50 条
  • [41] A Memory-Efficient and Modular Approach for Large-Scale String Pattern Matching
    Le, Hoang
    Prasanna, Viktor K.
    IEEE TRANSACTIONS ON COMPUTERS, 2013, 62 (05) : 844 - 857
  • [42] Visual Diagnostics of Parallel Performance in Training Large-Scale DNN Models
    Wei, Yating
    Wang, Zhiyong
    Wang, Zhongwei
    Dai, Yong
    Ou, Gongchang
    Gao, Han
    Yang, Haitao
    Wang, Yue
    Cao, Caleb Chen
    Weng, Luoxuan
    Lu, Jiaying
    Zhu, Rongchen
    Chen, Wei
    IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2024, 30 (07) : 3915 - 3929
  • [43] On Efficient Wear Leveling for Large-Scale Flash-Memory Storage Systems
    Chang, Li-Pin
    APPLIED COMPUTING 2007, VOL 1 AND 2, 2007, : 1126 - 1130
  • [44] Time and Memory Efficient Large-Scale Canonical Correlation Analysis in Fourier Domain
    Shen, Xiang-Jun
    Xu, Zhaorui
    Wang, Liangjun
    Li, Zechao
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5710 - 5718
  • [45] DistSim: A performance model of large-scale hybrid distributed DNN training
    Lu, Guandong
    Chen, Runzhe
    Wang, Yakai
    Zhou, Yangjie
    Zhang, Rui
    Hu, Zheng
    Miao, Yanming
    Cai, Zhifang
    Li, Li
    Leng, Jingwen
    Guo, Minyi
    PROCEEDINGS OF THE 20TH ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS 2023, CF 2023, 2023, : 112 - 122
  • [46] GradientFlow: Optimizing Network Performance for Large-Scale Distributed DNN Training
    Sun, Peng
    Wen, Yonggang
    Han, Ruobing
    Feng, Wansen
    Yan, Shengen
    IEEE TRANSACTIONS ON BIG DATA, 2022, 8 (02) : 495 - 507
  • [48] LARGE-SCALE SORTING IN UNIFORM MEMORY HIERARCHIES
    VITTER, JS
    NODINE, MH
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1993, 17 (1-2) : 107 - 114
  • [49] Optimizing memory transactions for large-scale programs
    Carvalho, Fernando Miguel
    Cachopo, Joao
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2016, 89 : 13 - 24
  • [50] Efficient Migration of Large-Memory VMs Using Private Virtual Memory
    Muraoka, Yuji
    Kourai, Kenichi
    ADVANCES IN INTELLIGENT NETWORKING AND COLLABORATIVE SYSTEMS, INCOS - 2019, 2020, 1035 : 380 - 389