RoPIM: A Processing-in-Memory Architecture for Accelerating Rotary Positional Embedding in Transformer Models

被引:0
|
作者
Jeon, Yunhyeong [1 ]
Jang, Minwoo [1 ]
Lee, Hwanjun [1 ]
Jung, Yeji [1 ]
Jung, Jin [2 ]
Lee, Jonggeon [2 ]
So, Jinin [2 ]
Kim, Daehoon [3 ]
机构
[1] DGIST, Daegu 42988, South Korea
[2] Samsung Elect, Hwaseong 443743, South Korea
[3] Yonsei Univ, Seoul 03722, South Korea
关键词
Graphics processing units; Transformers; Random access memory; Kernel; Computer architecture; Natural language processing; Computational modeling; Vectors; Inverters; Encoding; Processing-in-memory; transformer model; rotary positional embedding;
D O I
10.1109/LCA.2025.3535470
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The emergence of attention-based Transformer models, such as GPT, BERT, and LLaMA, has revolutionized Natural Language Processing (NLP) by significantly improving performance across a wide range of applications. A critical factor driving these improvements is the use of positional embeddings, which are crucial for capturing the contextual relationships between tokens in a sequence. However, current positional embedding methods face challenges, particularly in managing performance overhead for long sequences and effectively capturing relationships between adjacent tokens. In response, Rotary Positional Embedding (RoPE) has emerged as a method that effectively embeds positional information with high accuracy and without necessitating model retraining even with long sequences. Despite its effectiveness, RoPE introduces a considerable performance bottleneck during inference. We observe that RoPE accounts for 61% of GPU execution time due to extensive data movement and execution dependencies. In this paper, we introduce RoPIM, a Processing-In-Memory (PIM) architecture designed to efficiently accelerate RoPE operations in Transformer models. RoPIM achieves this by utilizing a bank-level accelerator that reduces off-chip data movement through in-accelerator support for multiply-addition operations and minimizes operational dependencies via parallel data rearrangement. Additionally, RoPIM proposes an optimized data mapping strategy that leverages both bank-level and row-level mappings to enable parallel execution, eliminate bank-to-bank communication, and reduce DRAM activations. Our experimental results show that RoPIM achieves up to a 307.9x performance improvement and 914.1x energy savings compared to conventional systems.
引用
收藏
页码:41 / 44
页数:4
相关论文
共 50 条
  • [31] A Study of Data Layout in Multi-channel Processing-In-Memory Architecture
    Jeong, Taeyang
    Choi, Duheon
    Han, Sangwoo
    Chung, Eui-Young
    PROCEEDINGS OF 2018 7TH INTERNATIONAL CONFERENCE ON SOFTWARE AND COMPUTER APPLICATIONS (ICSCA 2018), 2018, : 134 - 138
  • [32] An Efficient Racetrack Memory-Based Processing-In-Memory Architecture for Convolutional Neural Networks
    Liu, Bicheng
    Gu, Shouzhen
    Chen, Mingsong
    Kang, Wang
    Hu, Jingtong
    Zhuge, Qingfeng
    Sha, Edwin H-M
    2017 15TH IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING WITH APPLICATIONS AND 2017 16TH IEEE INTERNATIONAL CONFERENCE ON UBIQUITOUS COMPUTING AND COMMUNICATIONS (ISPA/IUCC 2017), 2017, : 383 - 390
  • [33] A Ferroelectric FET-Based Processing-in-Memory Architecture for DNN Acceleration
    Long, Yun
    Kim, Daehyun
    Lee, Edward
    Saha, Priyabrata
    Mudassar, Burhan Ahmad
    She, Xueyuan
    Khan, Asif Islam
    Mukhopadhyay, Saibal
    IEEE JOURNAL ON EXPLORATORY SOLID-STATE COMPUTATIONAL DEVICES AND CIRCUITS, 2019, 5 (02): : 113 - 122
  • [34] A Novel ReRAM-based Processing-in-Memory Architecture for Graph Computing
    Han, Lei
    Shen, Zhaoyan
    Shao, Zili
    Huang, H. Howie
    Li, Tao
    2017 IEEE 6TH NON-VOLATILE MEMORY SYSTEMS AND APPLICATIONS SYMPOSIUM (NVMSA 2017), 2017,
  • [35] Accelerating Graph Convolutional Networks Using Crossbar-based Processing-In-Memory Architectures
    Huang, Yu
    Zheng, Long
    Yao, Pengcheng
    Wang, Qinggang
    Liao, Xiaofei
    Jin, Hai
    Xue, Jingling
    2022 IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE (HPCA 2022), 2022, : 1029 - 1042
  • [36] A Novel ReRAM-Based Processing-in-Memory Architecture for Graph Traversal
    Han, Lei
    Shen, Zhaoyan
    Liu, Duo
    Shao, Zili
    Huang, H. Howie
    Li, Tao
    ACM TRANSACTIONS ON STORAGE, 2018, 14 (01)
  • [37] A bio-inspired positional embedding network for transformer-based models
    Tang, Xue-song
    Hao, Kuangrong
    Wei, Hui
    NEURAL NETWORKS, 2023, 166 : 204 - 214
  • [38] ReRAM-Based Processing-in-Memory Architecture for Recurrent Neural Network Acceleration
    Long, Yun
    Na, Taesik
    Mukhopadhyay, Saibal
    IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2018, 26 (12) : 2781 - 2794
  • [39] Processing-in-memory (PIM)-based Manycore Architecture for Training Graph Neural Networks
    Pande, Partha P.
    2023 INTERNATIONAL VLSI SYMPOSIUM ON TECHNOLOGY, SYSTEMS AND APPLICATIONS, VLSI-TSA/VLSI-DAT, 2023,
  • [40] Breaking the von Neumann bottleneck: architecture-level processing-in-memory technology
    Zou, Xingqi
    Xu, Sheng
    Chen, Xiaoming
    Yan, Liang
    Han, Yinhe
    SCIENCE CHINA-INFORMATION SCIENCES, 2021, 64 (06)