Transformer Dissection: An Unified Understanding for Transformer's Attention via the Lens of Kernel

被引：0

作者：

Tsai, Yao-Hung Hubert ^{[1
]}

Bai, Shaojie ^{[1
]}

Yamada, Makoto ^{[3
,4
]}

Morency, Louis-Philippe ^{[2
]}

Salakhutdinov, Ruslan ^{[1
]}

机构：

[1] Carnegie Mellon Univ, Machine Learning Dept, Pittsburgh, PA 15213 USA

[2] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA

[3] Kyoto Univ, Kyoto, Japan

[4] RIKEN AIP, Wako, Saitama, Japan

来源：

2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE | 2019年

基金：

美国国家卫生研究院;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Transformer is a powerful architecture that achieves superior performance on various sequence learning tasks, including neural machine translation, language understanding, and sequence prediction. At the core of the Transformer is the attention mechanism, which concurrently processes all inputs in the streams. In this paper, we present a new formulation of attention via the lens of the kernel. To be more precise, we realize that the attention can be seen as applying kernel smoother over the inputs with the kernel scores being the similarities between inputs. This new formulation gives us a better way to understand individual components of the Transformer's attention, such as the better way to integrate the positional embedding. Another important advantage of our kernel-based formulation is that it paves the way to a larger space of composing Transformer's attention. As an example, we propose a new variant of Transformer's attention which models the input as a product of symmetric kernels. This approach achieves competitive performance to the current state of the art model with less computation. In our experiments, we empirically study different kernel construction strategies on two widely used tasks: neural machine translation and sequence prediction.

引用

页码：4344 / 4353

页数：10

共 50 条

[1] Exploring high-quality image deraining Transformer via effective large kernel attention
Dong, Haobo
Song, Tianyu
Qi, Xuanyu
Jin, Jiyu
Jin, Guiyue
Fan, Lei
VISUAL COMPUTER, 2025, 41 (04): : 2545 - 2561
[2] Fast Vision Transformer via Additive Attention
Wen, Yang
Chen, Samuel
Shrestha, Abhishek Krishna
2024 IEEE CONFERENCE ON ARTIFICIAL INTELLIGENCE, CAI 2024, 2024, : 573 - 574
[3] Empowering lightweight video transformer via the kernel learning
Liu, Xiaoxi
Liu, Ju
Gu, Lingchen
ELECTRONICS LETTERS, 2024, 60 (09)
[4] Kernel Attention Transformer (KAT) for Histopathology Whole Slide Image Classification
Zheng, Yushan
Li, Jun
Shi, Jun
Xie, Fengying
Jiang, Zhiguo
MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2022, PT II, 2022, 13432 : 283 - 292
[5] Efficient Deraining model using Transformer and Kernel Basis Attention for UAVs
Tomida, Yuto
Katayama, Takafumi
Song, Tian
Shimamoto, Takashi
2024 INTERNATIONAL TECHNICAL CONFERENCE ON CIRCUITS/SYSTEMS, COMPUTERS, AND COMMUNICATIONS, ITC-CSCC 2024, 2024,
[6] ATTEXPLAINER: Explain Transformer via Attention by Reinforcement Learning
Niu, Runliang
Wei, Zhepei
Wang, Yan
Wang, Qi
PROCEEDINGS OF THE THIRTY-FIRST INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2022, 2022, : 724 - 731
[7] Accelerating Neural Transformer via an Average Attention Network
Zhang, Biao
Xiong, Deyi
Su, Jinsong
PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, 2018, : 1789 - 1798
[8] Interpreting Potts and Transformer Protein Models Through the Lens of Simplified Attention
Bhattacharya, Nicholas
Thomas, Neil
Rao, Roshan
Dauparas, Justas
Koo, Peter K.
Baker, David
Song, Yun S.
Ovchinnikov, Sergey
BIOCOMPUTING 2022, PSB 2022, 2022, : 34 - 45
[9] Sequential Transformer via an Outside-In Attention for image captioning
Wei, Yiwei
Wu, Chunlei
Li, Guohe
Shi, Haitao
ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2022, 108
[10] Patch Attacks on Vision Transformer via Skip Attention Gradients
Deng, Haoyu
Fang, Yanmei
Huang, Fangjun
PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT VIII, 2025, 15038 : 554 - 567

← 1 2 3 4 5 →