RoPIM: A Processing-in-Memory Architecture for Accelerating Rotary Positional Embedding in Transformer Models

被引:0
|
作者
Jeon, Yunhyeong [1 ]
Jang, Minwoo [1 ]
Lee, Hwanjun [1 ]
Jung, Yeji [1 ]
Jung, Jin [2 ]
Lee, Jonggeon [2 ]
So, Jinin [2 ]
Kim, Daehoon [3 ]
机构
[1] DGIST, Daegu 42988, South Korea
[2] Samsung Elect, Hwaseong 443743, South Korea
[3] Yonsei Univ, Seoul 03722, South Korea
关键词
Graphics processing units; Transformers; Random access memory; Kernel; Computer architecture; Natural language processing; Computational modeling; Vectors; Inverters; Encoding; Processing-in-memory; transformer model; rotary positional embedding;
D O I
10.1109/LCA.2025.3535470
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The emergence of attention-based Transformer models, such as GPT, BERT, and LLaMA, has revolutionized Natural Language Processing (NLP) by significantly improving performance across a wide range of applications. A critical factor driving these improvements is the use of positional embeddings, which are crucial for capturing the contextual relationships between tokens in a sequence. However, current positional embedding methods face challenges, particularly in managing performance overhead for long sequences and effectively capturing relationships between adjacent tokens. In response, Rotary Positional Embedding (RoPE) has emerged as a method that effectively embeds positional information with high accuracy and without necessitating model retraining even with long sequences. Despite its effectiveness, RoPE introduces a considerable performance bottleneck during inference. We observe that RoPE accounts for 61% of GPU execution time due to extensive data movement and execution dependencies. In this paper, we introduce RoPIM, a Processing-In-Memory (PIM) architecture designed to efficiently accelerate RoPE operations in Transformer models. RoPIM achieves this by utilizing a bank-level accelerator that reduces off-chip data movement through in-accelerator support for multiply-addition operations and minimizes operational dependencies via parallel data rearrangement. Additionally, RoPIM proposes an optimized data mapping strategy that leverages both bank-level and row-level mappings to enable parallel execution, eliminate bank-to-bank communication, and reduce DRAM activations. Our experimental results show that RoPIM achieves up to a 307.9x performance improvement and 914.1x energy savings compared to conventional systems.
引用
收藏
页码:41 / 44
页数:4
相关论文
共 50 条
  • [21] Towards Memory-Efficient Allocation of CNNs on Processing-in-Memory Architecture
    Wang, Yi
    Chen, Weixuan
    Yang, Jing
    Li, Tao
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2018, 29 (06) : 1428 - 1441
  • [22] Processing-in-Memory Architecture with Precision-Scaling for Malware Detection
    Kasarapu, Sreenitha
    Bavikadi, Sathwika
    Dinakarrao, Sai Manoj Pudukotai
    PROCEEDINGS OF THE 37TH INTERNATIONAL CONFERENCE ON VLSI DESIGN, VLSID 2024 AND 23RD INTERNATIONAL CONFERENCE ON EMBEDDED SYSTEMS, ES 2024, 2024, : 529 - 534
  • [23] A ReRAM-Based Processing-In-Memory Architecture for Hyperdimensional Computing
    Liu, Cong
    Wu, Kaibo
    Liu, Haikun
    Jin, Hai
    Liao, Xiaofei
    Duan, Zhuohui
    Xu, Jiahong
    Li, Huize
    Zhang, Yu
    Yang, Jing
    IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2025, 44 (02) : 512 - 524
  • [24] ReRAM-based Processing-in-Memory Architecture for Blockchain Platforms
    Wang, Fang
    Shen, Zhaoyan
    Han, Lei
    Shao, Zili
    24TH ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE (ASP-DAC 2019), 2019, : 615 - 620
  • [25] Optimization of OLAP In-Memory Database Management Systems with Processing-In-Memory Architecture
    Hosseinzadeh, Shima
    Parvaresh, Amirhossein
    Fey, Dietmar
    ARCHITECTURE OF COMPUTING SYSTEMS, ARCS 2023, 2023, 13949 : 264 - 278
  • [26] Accelerating Deep Neural Networks in Processing-in-Memory Platforms: Analog or Digital Approach?
    Angizi, Shaahin
    He, Zhezhi
    Reis, Dayane
    Hu, Xiaobo Sharon
    Tsai, Wilman
    Lin, Shy Jay
    Fan, Deliang
    2019 IEEE COMPUTER SOCIETY ANNUAL SYMPOSIUM ON VLSI (ISVLSI 2019), 2019, : 198 - 203
  • [27] Towards Memory-Efficient Processing-in-Memory Architecture for Convolutional Neural Networks
    Wang, Yi
    Zhang, Mingxu
    Yang, Jing
    ACM SIGPLAN NOTICES, 2017, 52 (05) : 81 - 90
  • [28] COPPER: a combinatorial optimization problem solver with processing-in-memory architecture
    Wang, Qiankun
    Li, Xingchen
    Wu, Bingzhe
    Yang, Ke
    Hu, Wei
    Sun, Guangyu
    Yang, Yuchao
    FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2023, 24 (05) : 731 - 741
  • [29] AR-PIM: An Adaptive-Range Processing-in-Memory Architecture
    Chou, Teyuh
    Garcia-Redondo, Fernando
    Whatmough, Paul
    Zhang, Zhengya
    2023 IEEE/ACM INTERNATIONAL SYMPOSIUM ON LOW POWER ELECTRONICS AND DESIGN, ISLPED, 2023,
  • [30] Lattice: An ADC/DAC-less ReRAM-based Processing-In-Memory Architecture for Accelerating Deep Convolution Neural Networks
    Zheng, Qilin
    Wang, Zongwei
    Feng, Zishun
    Yan, Bonan
    Cai, Yimao
    Huang, Ru
    Chen, Yiran
    Yang, Chia-Lin
    Li, Hai
    PROCEEDINGS OF THE 2020 57TH ACM/EDAC/IEEE DESIGN AUTOMATION CONFERENCE (DAC), 2020,