Spatz: A Compact Vector Processing Unit for High-Performance and Energy-Efficient Shared-L1 Clusters

被引：9

作者：

Cavalcante, Matheus ^{[1
]}

Wüthrich, Domenic ^{[1
]}

Perotti, Matteo ^{[1
]}

Riedel, Samuel ^{[1
]}

Benini, Luca ^{[1
,2
]}

机构：

[1] Swiss Fed Inst Technol, Integrated Syst Lab, Zurich, Switzerland

[2] Univ Bologna, Bologna, Italy

来源：

2022 IEEE/ACM INTERNATIONAL CONFERENCE ON COMPUTER AIDED DESIGN, ICCAD | 2022年

关键词：

Vector Processing; SIMD; Many-Core; RISC-V Vector Extension;

D O I：

10.1145/3508352.3549367

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

While parallel architectures based on clusters of Processing Elements (PEs) sharing L1 memory are widespread, there is no consensus on how lean their PE should be. Architecting PEs as vector processors holds the promise to greatly reduce their instruction fetch bandwidth, mitigating the Von Neumann Bottleneck (VNB). However, due to their historical association with supercomputers, classical vector machines include microarchitectural tricks to improve the Instruction Level Parallelism (ILP), which increases their instruction fetch and decode energy overhead. In this paper, we explore for the first time vector processing as an option to build small and efficient PEs for large-scale shared-L1 clusters. We propose Spatz, a compact, modular 32-bit vector processing unit based on the integer embedded subset of the RISC-V Vector Extension version 1.0. A Spatz-based cluster with four Multiply-Accumulate Units (MACUs) needs only 7.9 pJ per 32-bit integer multiply-accumulate operation, 40% less energy than an equivalent cluster built with four Snitch scalar cores. We analyzed Spatz' performance by integrating it within MemPool, a large-scale many-core shared-L1 cluster. The Spatz-based MemPool system achieves up to 285 GOPS when running a 256 x 256 32-bit integer matrix multiplication, 70% more than the equivalent Snitch-based MemPool system. In terms of energy efficiency, the Spatz-based MemPool system achieves up to 266 GOPS/W when running the same kernel, more than twice the energy efficiency of the Snitch-based MemPool system, which reaches 128 GOPS/W. Those results show the viability of lean vector processors as high-performance and energy-efficient PEs for large-scale clusters with tightly-coupled L1 memory.

引用

页数：9

共 50 条

[1] ABSX: The Chiplet Hyperscale AI Processing Unit for Energy-Efficient High-Performance AI Processing
Kwon, Youngsu
2023 20TH INTERNATIONAL SOC DESIGN CONFERENCE, ISOCC, 2023, : 217 - 218
[2] High-performance, energy-efficient IGBTs
Snyder, Lucy A.
Electron Prod Garden City NY, 2008, 8
[3] TuNao: A High-Performance and Energy-Efficient Reconfigurable Accelerator for Graph Processing
Zhou, Jinhong
Liu, Shaoli
Guo, Qi
Zhou, Xuda
Zhi, Tian
Liu, Daofu
Wang, Chao
Zhou, Xuehai
Chen, Yunji
Chen, Tianshi
2017 17TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID), 2017, : 731 - 734
[4] Energy-Efficient and High-Performance Data Converters
Goes, Joao
2024 31ST INTERNATIONAL CONFERENCE ON MIXED DESIGN OF INTEGRATED CIRCUITS AND SYSTEM, MIXDES 2024, 2024, : 15 - 15
[5] Encodings for high-performance energy-efficient signaling
Bogliolo, A
ISLPED'01: PROCEEDINGS OF THE 2001 INTERNATIONAL SYMPOSIUM ON LOWPOWER ELECTRONICS AND DESIGN, 2001, : 170 - 175
[6] Energy-efficient high-performance storage system
Wang, Jun
2008 IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL & DISTRIBUTED PROCESSING, VOLS 1-8, 2008, : 2640 - 2644
[7] Constructing a high-performance, energy-efficient cleanroom
Patel, Bill
Greiner, Jerry
Huffman, Tom R.
Microcontamination, 1991, 9 (02): : 29 - 32
[8] Energy-Efficient Hardware-Accelerated Synchronization for Shared-L1-Memory Multiprocessor Clusters
Glaser, Florian
Tagliavini, Giuseppe
Rossi, Davide
Haugou, Germain
Huang, Qiuting
Benini, Luca
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2021, 32 (03) : 633 - 648
[9] Energy-efficient high-performance parallel and distributed computing
Khan, Samee Ullah
Bouvry, Pascal
Engel, Thomas
JOURNAL OF SUPERCOMPUTING, 2012, 60 (02): : 163 - 164
[10] High-Performance Energy-Efficient Multicore Embedded Computing
Munir, Arslan
Ranka, Sanjay
Gordon-Ross, Ann
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2012, 23 (04) : 684 - 700

← 1 2 3 4 5 →