Spatz: A Compact Vector Processing Unit for High-Performance and Energy-Efficient Shared-L1 Clusters

被引:9
|
作者
Cavalcante, Matheus [1 ]
Wüthrich, Domenic [1 ]
Perotti, Matteo [1 ]
Riedel, Samuel [1 ]
Benini, Luca [1 ,2 ]
机构
[1] Swiss Fed Inst Technol, Integrated Syst Lab, Zurich, Switzerland
[2] Univ Bologna, Bologna, Italy
来源
2022 IEEE/ACM INTERNATIONAL CONFERENCE ON COMPUTER AIDED DESIGN, ICCAD | 2022年
关键词
Vector Processing; SIMD; Many-Core; RISC-V Vector Extension;
D O I
10.1145/3508352.3549367
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
While parallel architectures based on clusters of Processing Elements (PEs) sharing L1 memory are widespread, there is no consensus on how lean their PE should be. Architecting PEs as vector processors holds the promise to greatly reduce their instruction fetch bandwidth, mitigating the Von Neumann Bottleneck (VNB). However, due to their historical association with supercomputers, classical vector machines include microarchitectural tricks to improve the Instruction Level Parallelism (ILP), which increases their instruction fetch and decode energy overhead. In this paper, we explore for the first time vector processing as an option to build small and efficient PEs for large-scale shared-L1 clusters. We propose Spatz, a compact, modular 32-bit vector processing unit based on the integer embedded subset of the RISC-V Vector Extension version 1.0. A Spatz-based cluster with four Multiply-Accumulate Units (MACUs) needs only 7.9 pJ per 32-bit integer multiply-accumulate operation, 40% less energy than an equivalent cluster built with four Snitch scalar cores. We analyzed Spatz' performance by integrating it within MemPool, a large-scale many-core shared-L1 cluster. The Spatz-based MemPool system achieves up to 285 GOPS when running a 256 x 256 32-bit integer matrix multiplication, 70% more than the equivalent Snitch-based MemPool system. In terms of energy efficiency, the Spatz-based MemPool system achieves up to 266 GOPS/W when running the same kernel, more than twice the energy efficiency of the Snitch-based MemPool system, which reaches 128 GOPS/W. Those results show the viability of lean vector processors as high-performance and energy-efficient PEs for large-scale clusters with tightly-coupled L1 memory.
引用
收藏
页数:9
相关论文
共 50 条
  • [1] ABSX: The Chiplet Hyperscale AI Processing Unit for Energy-Efficient High-Performance AI Processing
    Kwon, Youngsu
    2023 20TH INTERNATIONAL SOC DESIGN CONFERENCE, ISOCC, 2023, : 217 - 218
  • [2] High-performance, energy-efficient IGBTs
    Snyder, Lucy A.
    Electron Prod Garden City NY, 2008, 8
  • [3] TuNao: A High-Performance and Energy-Efficient Reconfigurable Accelerator for Graph Processing
    Zhou, Jinhong
    Liu, Shaoli
    Guo, Qi
    Zhou, Xuda
    Zhi, Tian
    Liu, Daofu
    Wang, Chao
    Zhou, Xuehai
    Chen, Yunji
    Chen, Tianshi
    2017 17TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID), 2017, : 731 - 734
  • [4] Energy-Efficient and High-Performance Data Converters
    Goes, Joao
    2024 31ST INTERNATIONAL CONFERENCE ON MIXED DESIGN OF INTEGRATED CIRCUITS AND SYSTEM, MIXDES 2024, 2024, : 15 - 15
  • [5] Encodings for high-performance energy-efficient signaling
    Bogliolo, A
    ISLPED'01: PROCEEDINGS OF THE 2001 INTERNATIONAL SYMPOSIUM ON LOWPOWER ELECTRONICS AND DESIGN, 2001, : 170 - 175
  • [6] Energy-efficient high-performance storage system
    Wang, Jun
    2008 IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL & DISTRIBUTED PROCESSING, VOLS 1-8, 2008, : 2640 - 2644
  • [7] Constructing a high-performance, energy-efficient cleanroom
    Patel, Bill
    Greiner, Jerry
    Huffman, Tom R.
    Microcontamination, 1991, 9 (02): : 29 - 32
  • [8] Energy-Efficient Hardware-Accelerated Synchronization for Shared-L1-Memory Multiprocessor Clusters
    Glaser, Florian
    Tagliavini, Giuseppe
    Rossi, Davide
    Haugou, Germain
    Huang, Qiuting
    Benini, Luca
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2021, 32 (03) : 633 - 648
  • [9] Energy-efficient high-performance parallel and distributed computing
    Khan, Samee Ullah
    Bouvry, Pascal
    Engel, Thomas
    JOURNAL OF SUPERCOMPUTING, 2012, 60 (02): : 163 - 164
  • [10] High-Performance Energy-Efficient Multicore Embedded Computing
    Munir, Arslan
    Ranka, Sanjay
    Gordon-Ross, Ann
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2012, 23 (04) : 684 - 700