Transformer Dissection: An Unified Understanding for Transformer's Attention via the Lens of Kernel

被引:0
|
作者
Tsai, Yao-Hung Hubert [1 ]
Bai, Shaojie [1 ]
Yamada, Makoto [3 ,4 ]
Morency, Louis-Philippe [2 ]
Salakhutdinov, Ruslan [1 ]
机构
[1] Carnegie Mellon Univ, Machine Learning Dept, Pittsburgh, PA 15213 USA
[2] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA
[3] Kyoto Univ, Kyoto, Japan
[4] RIKEN AIP, Wako, Saitama, Japan
基金
美国国家卫生研究院;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Transformer is a powerful architecture that achieves superior performance on various sequence learning tasks, including neural machine translation, language understanding, and sequence prediction. At the core of the Transformer is the attention mechanism, which concurrently processes all inputs in the streams. In this paper, we present a new formulation of attention via the lens of the kernel. To be more precise, we realize that the attention can be seen as applying kernel smoother over the inputs with the kernel scores being the similarities between inputs. This new formulation gives us a better way to understand individual components of the Transformer's attention, such as the better way to integrate the positional embedding. Another important advantage of our kernel-based formulation is that it paves the way to a larger space of composing Transformer's attention. As an example, we propose a new variant of Transformer's attention which models the input as a product of symmetric kernels. This approach achieves competitive performance to the current state of the art model with less computation. In our experiments, we empirically study different kernel construction strategies on two widely used tasks: neural machine translation and sequence prediction.
引用
收藏
页码:4344 / 4353
页数:10
相关论文
共 50 条
  • [1] Exploring high-quality image deraining Transformer via effective large kernel attention
    Dong, Haobo
    Song, Tianyu
    Qi, Xuanyu
    Jin, Jiyu
    Jin, Guiyue
    Fan, Lei
    VISUAL COMPUTER, 2025, 41 (04): : 2545 - 2561
  • [2] Fast Vision Transformer via Additive Attention
    Wen, Yang
    Chen, Samuel
    Shrestha, Abhishek Krishna
    2024 IEEE CONFERENCE ON ARTIFICIAL INTELLIGENCE, CAI 2024, 2024, : 573 - 574
  • [3] Empowering lightweight video transformer via the kernel learning
    Liu, Xiaoxi
    Liu, Ju
    Gu, Lingchen
    ELECTRONICS LETTERS, 2024, 60 (09)
  • [4] Kernel Attention Transformer (KAT) for Histopathology Whole Slide Image Classification
    Zheng, Yushan
    Li, Jun
    Shi, Jun
    Xie, Fengying
    Jiang, Zhiguo
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2022, PT II, 2022, 13432 : 283 - 292
  • [5] Efficient Deraining model using Transformer and Kernel Basis Attention for UAVs
    Tomida, Yuto
    Katayama, Takafumi
    Song, Tian
    Shimamoto, Takashi
    2024 INTERNATIONAL TECHNICAL CONFERENCE ON CIRCUITS/SYSTEMS, COMPUTERS, AND COMMUNICATIONS, ITC-CSCC 2024, 2024,
  • [6] ATTEXPLAINER: Explain Transformer via Attention by Reinforcement Learning
    Niu, Runliang
    Wei, Zhepei
    Wang, Yan
    Wang, Qi
    PROCEEDINGS OF THE THIRTY-FIRST INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2022, 2022, : 724 - 731
  • [7] Accelerating Neural Transformer via an Average Attention Network
    Zhang, Biao
    Xiong, Deyi
    Su, Jinsong
    PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, 2018, : 1789 - 1798
  • [8] Interpreting Potts and Transformer Protein Models Through the Lens of Simplified Attention
    Bhattacharya, Nicholas
    Thomas, Neil
    Rao, Roshan
    Dauparas, Justas
    Koo, Peter K.
    Baker, David
    Song, Yun S.
    Ovchinnikov, Sergey
    BIOCOMPUTING 2022, PSB 2022, 2022, : 34 - 45
  • [9] Sequential Transformer via an Outside-In Attention for image captioning
    Wei, Yiwei
    Wu, Chunlei
    Li, Guohe
    Shi, Haitao
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2022, 108
  • [10] Patch Attacks on Vision Transformer via Skip Attention Gradients
    Deng, Haoyu
    Fang, Yanmei
    Huang, Fangjun
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT VIII, 2025, 15038 : 554 - 567