GPU Performance Optimization via Intergroup Cache Cooperation

被引：0

作者：

Wang, Guosheng ^{[1
]}

Du, Yajuan ^{[1
,2
]}

Huang, Weiming ^{[1
]}

机构：

[1] Wuhan Univ Technol, Sch Comp Sci & Technol, Wuhan 430070, Peoples R China

[2] Wuhan Univ Technol, Shenzhen Res Inst, Shenzhen 518000, Peoples R China

来源：

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS | 2024年 / 43卷 / 11期

关键词：

Integrated circuits; Design automation; Graphics processing units; Computer architecture; Bidirectional control; Bandwidth; Benchmark testing; System-on-chip; Optimization; Cache; cooperation; GPU; hit ratio;

D O I：

10.1109/TCAD.2024.3443707

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Modern GPUs have integrated multilevel cache hierarchy to provide high bandwidth and mitigate the memory wall problem. However, the benefit of on-chip cache is far from achieving optimal performance. In this article, we investigate existing cache architecture and find that the cache utilization is imbalanced and there exists serious data duplication among L1 cache groups.In order to exploit the duplicate data, we propose an intergroup cache cooperation (ICC) method to establish the cooperation across L1 cache groups. According the cooperation scope, we design two schemes of the adjacent cache cooperation (ICC-AGC) and the multiple cache cooperation (ICC-MGC). In ICC-AGC, we design an adjacent cooperative directory table to realize the perception of duplicate data and integrate a lightweight network for communication. In ICC-MGC, a ring bi-directional network is designed to realize the connection among multiple groups. And we present a two-way sending mechanism and a dynamic sending mechanism to balance the overhead and efficiency involved in request probing and sending.Evaluation results show that the proposed two ICC methods can reduce the average traffic to L2 cache by 10% and 20%, respectively, and improve overall GPU performance by 19% and 49% on average, respectively, compared with the existing work.

引用

页码：4142 / 4153

页数：12

共 50 条

[21] Cache Coherence for GPU Architectures
Singh, Inderpreet
Shriraman, Arrvindh
Fung, Wilson W. L.
O'Connor, Mike
Aamodt, Tor M.
19TH IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE (HPCA2013), 2013, : 578 - 590
[22] CACHE COHERENCE FOR GPU ARCHITECTURES
Singh, Inderpreet
Shriraman, Arrvindh
Fung, Wilson W. L.
O'Connor, Mike
Aamodt, Tor M.
IEEE MICRO, 2014, 34 (03) : 69 - 79
[23] Performance Optimization Strategies of High Performance Computing on GPU
Ma, Anguo
Cai, Jing
Cheng, Yu
Ni, Xiaoqiang
Tang, Yuxing
Xing, Zuocheng
ADVANCED PARALLEL PROCESSING TECHNOLOGIES, PROCEEDINGS, 2009, 5737 : 150 - 164
[24] INTERGROUP AND INTRAGROUP COMPETITION AND COOPERATION
GOLDMAN, M
STOCKBAUER, JW
MCAULIFFE, TG
JOURNAL OF EXPERIMENTAL SOCIAL PSYCHOLOGY, 1977, 13 (01) : 81 - 88
[25] SOCIAL DIFFERENTIATION IN INTERGROUP COOPERATION
BECK, D
INTERNATIONAL JOURNAL OF SMALL GROUP RESEARCH, 1987, 3 (02): : 221 - 223
[26] SOCIAL DIFFERENTIATION IN INTERGROUP COOPERATION
BECK, D
INTERNATIONAL JOURNAL OF SMALL GROUP RESEARCH, 1988, 4 (01): : 3 - 29
[27] Performance Evaluations of Document-Oriented Databases using GPU and Cache Structure
Morishima, Shin
Matsutani, Hiroki
2015 IEEE TRUSTCOM/BIGDATASE/ISPA, VOL 3, 2015, : 108 - 115
[28] RDGC: A REUSE DISTANCE-BASED APPROACH TO GPU CACHE PERFORMANCE ANALYSIS
Kiani, Mohsen
Rajabzadeh, Amir
COMPUTING AND INFORMATICS, 2019, 38 (02) : 421 - 453
[29] Performance Drawbacks for Matrix Multiplication using Set Associative Cache in GPU devices
Djinevski, Leonid
Arsenovski, Sime
Ristov, Sasko
Gusev, Marjan
2013 36TH INTERNATIONAL CONVENTION ON INFORMATION AND COMMUNICATION TECHNOLOGY, ELECTRONICS AND MICROELECTRONICS (MIPRO), 2013, : 193 - 198
[30] GPU performance optimization targeting OpenCL model
Chen, Gang
Wu, Baifeng
Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2011, 23 (04): : 571 - 581

← 1 2 3 4 5 →