Deep learning based data prefetching in CPU-GPU unified virtual memory

被引：9

作者：

Long, Xinjian ^{[1
,2
]}

Gong, Xiangyang ^{[1
,2
]}

Zhang, Bo ^{[1
,2
]}

Zhou, Huiyang ^{[3
]}

机构：

[1] Beijing Univ Posts & Telecommun, State Key Lab Networking & Switching Technol, Beijing 100876, Peoples R China

[2] Beijing Univ Posts & Telecommun, Sch Comp Sci, Natl Pilot Software Engn Sch, Beijing 100876, Peoples R China

[3] North Carolina State Univ, Dept Elect & Comp Engn, Raleigh, NC 27606 USA

来源：

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING | 2023年 / 174卷

基金：

中国国家自然科学基金;

关键词：

Data prefetching; Graphics processing unit; Unified virtual memory; Deep learning; Transformer;

D O I：

10.1016/j.jpdc.2022.12.004

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Unified Virtual Memory (UVM) relieves the developers from the onus of maintaining complex data structures and explicit data migration by enabling on-demand data movement between CPU memory and GPU memory. However, on-demand paging soon becomes a performance bottleneck of UVM due to the high latency caused by page table walks and data migration over interconnect. Prefetching is considered a promising solution to this problem given its ability to leverage the locality of program memory access patterns. However, existing locality-based prefetching schemes can not handle all the situations. An ideal prefetcher should not only look at narrow regions of the requested address space but also capture global context to deliver a good prediction of the memory access pattern. This paper proposes a novel framework for page prefetching for UVM through deep learning. We first show that a powerful Transformer learning model can provide high accuracy for UVM page prefetching. We then perform analysis to interpret this Transformer model and derive several insights that allow us to design a simpler model to match the unconstrained model's accuracy with orders of magnitude lower cost. We use a pattern-based method to make the UVM page preditor general to different GPU workloads. We evaluate this framework on a set of 11 memory-intensive benchmarks from popular benchmark suites. Our solution outperforms the state-of-the-art (SOTA) UVM framework, improving the performance by 10.89%, improving the device memory page hit rate by 16.98% (89.02% vs. 76.10% for prior art), and reducing the CPU-GPU interconnect traffic by 11.05%. According to our proposed unified metric, which combines the accuracy, coverage, and page hit rate, our solution is approaching the ideal prefetching scheme more than the SOTA design (0.90 vs. 0.85, with the perfect prefetcher of 1.0).(c) 2022 Elsevier Inc. All rights reserved.

引用

页码：19 / 31

页数：13

共 50 条

[21] Machine Learning Based Predictive Models in Mobile Platforms Using CPU-GPU
Sohankar, Javad
Pore, Madhurima
Banerjee, Ayan
Sadeghi, Koosha
Gupta, Sandeep K. S.
2020 7TH INTERNATIONAL CONFERENCE ON INTERNET OF THINGS: SYSTEMS, MANAGEMENT AND SECURITY (IOTSMS), 2020,
[22] A CPU-GPU Data Transfer Optimization Approach Based On Code Migration and Merging
Fu, Cong
Zhai, Yanlong
Wang, Zhenhua
2017 16TH INTERNATIONAL SYMPOSIUM ON DISTRIBUTED COMPUTING AND APPLICATIONS TO BUSINESS, ENGINEERING AND SCIENCE (DCABES), 2017, : 23 - 26
[23] Batched SVD on CPU-GPU based on integer programming
Huang, Jialin
He, Shufeng
Tian, Chunqi
Xu, Yanjun
2024 5TH INFORMATION COMMUNICATION TECHNOLOGIES CONFERENCE, ICTC 2024, 2024, : 344 - 348
[24] Fault-tolerant deep learning inference on CPU-GPU integrated edge devices with TEEs
Xu, Hongjian
Liao, Longlong
Liu, Xinqi
Chen, Shuguang
Chen, Jianguo
Liang, Zhixuan
Yu, Yuanlong
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2024, 161 : 404 - 414
[25] Big data simulation for surface reconstruction on CPU-GPU platform
Hadi, N. A.
2ND INTERNATIONAL CONFERENCE ON DATA AND INFORMATION SCIENCE, 2019, 1192
[26] Simultaneous CPU-GPU Execution of Data Parallel Algorithmic Skeletons
Wrede, Fabian
Ernsting, Steffen
INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2018, 46 (01) : 42 - 61
[27] Random Forests over normalized data in CPU-GPU DBMSes
Huang, Zezhou
Damalapati, Pavan Kalyan
Sen, Rathijit
Wu, Eugene
19TH INTERNATIONAL WORKSHOP ON DATA MANAGEMENT ON NEW HARDWARE, DAMON 2023, 2023, : 98 - 101
[28] Learning Based Performance and Power Efficient Cluster Resource Manager for CPU-GPU Cluster
Das, Soumen Kumar
Sudhakaran, G.
Ashok, V.
2014 FOURTH INTERNATIONAL CONFERENCE OF EMERGING APPLICATIONS OF INFORMATION TECHNOLOGY (EAIT), 2014, : 161 - 166
[29] A Sample-Based Dynamic CPU and GPU LLC Bypassing Method for Heterogeneous CPU-GPU Architectures
Wang, Xin
Zhang, Wei
2017 16TH IEEE INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS / 11TH IEEE INTERNATIONAL CONFERENCE ON BIG DATA SCIENCE AND ENGINEERING / 14TH IEEE INTERNATIONAL CONFERENCE ON EMBEDDED SOFTWARE AND SYSTEMS, 2017, : 753 - 760
[30] Scheduling CPU for GPU-based Deep Learning Jobs
Xiao, Wencong
Han, Zhenhua
Zhao, Hanyu
Peng, Xuan
Zhang, Quanlu
Yang, Fan
Zhou, Lidong
PROCEEDINGS OF THE 2018 ACM SYMPOSIUM ON CLOUD COMPUTING (SOCC '18), 2018, : 503 - 503

← 1 2 3 4 5 →