Deep learning based data prefetching in CPU-GPU unified virtual memory

被引:9
|
作者
Long, Xinjian [1 ,2 ]
Gong, Xiangyang [1 ,2 ]
Zhang, Bo [1 ,2 ]
Zhou, Huiyang [3 ]
机构
[1] Beijing Univ Posts & Telecommun, State Key Lab Networking & Switching Technol, Beijing 100876, Peoples R China
[2] Beijing Univ Posts & Telecommun, Sch Comp Sci, Natl Pilot Software Engn Sch, Beijing 100876, Peoples R China
[3] North Carolina State Univ, Dept Elect & Comp Engn, Raleigh, NC 27606 USA
基金
中国国家自然科学基金;
关键词
Data prefetching; Graphics processing unit; Unified virtual memory; Deep learning; Transformer;
D O I
10.1016/j.jpdc.2022.12.004
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Unified Virtual Memory (UVM) relieves the developers from the onus of maintaining complex data structures and explicit data migration by enabling on-demand data movement between CPU memory and GPU memory. However, on-demand paging soon becomes a performance bottleneck of UVM due to the high latency caused by page table walks and data migration over interconnect. Prefetching is considered a promising solution to this problem given its ability to leverage the locality of program memory access patterns. However, existing locality-based prefetching schemes can not handle all the situations. An ideal prefetcher should not only look at narrow regions of the requested address space but also capture global context to deliver a good prediction of the memory access pattern. This paper proposes a novel framework for page prefetching for UVM through deep learning. We first show that a powerful Transformer learning model can provide high accuracy for UVM page prefetching. We then perform analysis to interpret this Transformer model and derive several insights that allow us to design a simpler model to match the unconstrained model's accuracy with orders of magnitude lower cost. We use a pattern-based method to make the UVM page preditor general to different GPU workloads. We evaluate this framework on a set of 11 memory-intensive benchmarks from popular benchmark suites. Our solution outperforms the state-of-the-art (SOTA) UVM framework, improving the performance by 10.89%, improving the device memory page hit rate by 16.98% (89.02% vs. 76.10% for prior art), and reducing the CPU-GPU interconnect traffic by 11.05%. According to our proposed unified metric, which combines the accuracy, coverage, and page hit rate, our solution is approaching the ideal prefetching scheme more than the SOTA design (0.90 vs. 0.85, with the perfect prefetcher of 1.0).(c) 2022 Elsevier Inc. All rights reserved.
引用
收藏
页码:19 / 31
页数:13
相关论文
共 50 条
  • [21] Machine Learning Based Predictive Models in Mobile Platforms Using CPU-GPU
    Sohankar, Javad
    Pore, Madhurima
    Banerjee, Ayan
    Sadeghi, Koosha
    Gupta, Sandeep K. S.
    2020 7TH INTERNATIONAL CONFERENCE ON INTERNET OF THINGS: SYSTEMS, MANAGEMENT AND SECURITY (IOTSMS), 2020,
  • [22] A CPU-GPU Data Transfer Optimization Approach Based On Code Migration and Merging
    Fu, Cong
    Zhai, Yanlong
    Wang, Zhenhua
    2017 16TH INTERNATIONAL SYMPOSIUM ON DISTRIBUTED COMPUTING AND APPLICATIONS TO BUSINESS, ENGINEERING AND SCIENCE (DCABES), 2017, : 23 - 26
  • [23] Batched SVD on CPU-GPU based on integer programming
    Huang, Jialin
    He, Shufeng
    Tian, Chunqi
    Xu, Yanjun
    2024 5TH INFORMATION COMMUNICATION TECHNOLOGIES CONFERENCE, ICTC 2024, 2024, : 344 - 348
  • [24] Fault-tolerant deep learning inference on CPU-GPU integrated edge devices with TEEs
    Xu, Hongjian
    Liao, Longlong
    Liu, Xinqi
    Chen, Shuguang
    Chen, Jianguo
    Liang, Zhixuan
    Yu, Yuanlong
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2024, 161 : 404 - 414
  • [25] Big data simulation for surface reconstruction on CPU-GPU platform
    Hadi, N. A.
    2ND INTERNATIONAL CONFERENCE ON DATA AND INFORMATION SCIENCE, 2019, 1192
  • [26] Simultaneous CPU-GPU Execution of Data Parallel Algorithmic Skeletons
    Wrede, Fabian
    Ernsting, Steffen
    INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2018, 46 (01) : 42 - 61
  • [27] Random Forests over normalized data in CPU-GPU DBMSes
    Huang, Zezhou
    Damalapati, Pavan Kalyan
    Sen, Rathijit
    Wu, Eugene
    19TH INTERNATIONAL WORKSHOP ON DATA MANAGEMENT ON NEW HARDWARE, DAMON 2023, 2023, : 98 - 101
  • [28] Learning Based Performance and Power Efficient Cluster Resource Manager for CPU-GPU Cluster
    Das, Soumen Kumar
    Sudhakaran, G.
    Ashok, V.
    2014 FOURTH INTERNATIONAL CONFERENCE OF EMERGING APPLICATIONS OF INFORMATION TECHNOLOGY (EAIT), 2014, : 161 - 166
  • [29] A Sample-Based Dynamic CPU and GPU LLC Bypassing Method for Heterogeneous CPU-GPU Architectures
    Wang, Xin
    Zhang, Wei
    2017 16TH IEEE INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS / 11TH IEEE INTERNATIONAL CONFERENCE ON BIG DATA SCIENCE AND ENGINEERING / 14TH IEEE INTERNATIONAL CONFERENCE ON EMBEDDED SOFTWARE AND SYSTEMS, 2017, : 753 - 760
  • [30] Scheduling CPU for GPU-based Deep Learning Jobs
    Xiao, Wencong
    Han, Zhenhua
    Zhao, Hanyu
    Peng, Xuan
    Zhang, Quanlu
    Yang, Fan
    Zhou, Lidong
    PROCEEDINGS OF THE 2018 ACM SYMPOSIUM ON CLOUD COMPUTING (SOCC '18), 2018, : 503 - 503