Transformer Dissection: An Unified Understanding for Transformer's Attention via the Lens of Kernel

被引：0

作者：

Tsai, Yao-Hung Hubert ^{[1
]}

Bai, Shaojie ^{[1
]}

Yamada, Makoto ^{[3
,4
]}

Morency, Louis-Philippe ^{[2
]}

Salakhutdinov, Ruslan ^{[1
]}

机构：

[1] Carnegie Mellon Univ, Machine Learning Dept, Pittsburgh, PA 15213 USA

[2] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA

[3] Kyoto Univ, Kyoto, Japan

[4] RIKEN AIP, Wako, Saitama, Japan

来源：

2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE | 2019年

基金：

美国国家卫生研究院;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Transformer is a powerful architecture that achieves superior performance on various sequence learning tasks, including neural machine translation, language understanding, and sequence prediction. At the core of the Transformer is the attention mechanism, which concurrently processes all inputs in the streams. In this paper, we present a new formulation of attention via the lens of the kernel. To be more precise, we realize that the attention can be seen as applying kernel smoother over the inputs with the kernel scores being the similarities between inputs. This new formulation gives us a better way to understand individual components of the Transformer's attention, such as the better way to integrate the positional embedding. Another important advantage of our kernel-based formulation is that it paves the way to a larger space of composing Transformer's attention. As an example, we propose a new variant of Transformer's attention which models the input as a product of symmetric kernels. This approach achieves competitive performance to the current state of the art model with less computation. In our experiments, we empirically study different kernel construction strategies on two widely used tasks: neural machine translation and sequence prediction.

引用

页码：4344 / 4353

页数：10

共 50 条

[41] PCTDepth: Exploiting Parallel CNNs and Transformer via Dual Attention for Monocular Depth Estimation
Chenxing Xia
Xiuzhen Duan
Xiuju Gao
Bin Ge
Kuan-Ching Li
Xianjin Fang
Yan Zhang
Ke Yang
Neural Processing Letters, 56
[42] Scale-Insensitive Object Detection via Attention Feature Pyramid Transformer Network
Li, Lingling
Zheng, Changwen
Mao, Cunli
Deng, Haibo
Jin, Taisong
NEURAL PROCESSING LETTERS, 2022, 54 (01) : 581 - 595
[43] EFFECTIVE IMAGE TAMPERING LOCALIZATION VIA ENHANCED TRANSFORMER AND CO-ATTENTION FUSION
Guo, Kun
Zhu, Haochen
Cao, Gang
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 4895 - 4899
[44] Improving Multimodal Named Entity Recognition via Entity Span Detection with Unified Multimodal Transformer
Yu, Jianfei
Jiang, Jing
Yang, Li
Xia, Rui
58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 3342 - 3352
[45] Speech recognition based on the transformer's multi-head attention in Arabic
Mahmoudi O.
Filali-Bouami M.
Benchat M.
International Journal of Speech Technology, 2024, 27 (01) : 211 - 223
[46] A coordinate attention enhanced swin transformer for handwriting recognition of Parkinson's disease
Wang, Nana
Niu, Xuesen
Yuan, Yiyang
Sun, Yunze
Li, Ran
You, Guoliang
Zhao, Aite
IET IMAGE PROCESSING, 2023, 17 (09) : 2686 - 2697
[47] Exploring high-quality image deraining Transformer via effective large kernel attentionExploring high-quality image deraining Transformer via effective large kernel attentionH. Dong et al.
Haobo Dong
Tianyu Song
Xuanyu Qi
Jiyu Jin
Guiyue Jin
Lei Fan
The Visual Computer, 2025, 41 (4) : 2545 - 2561
[48] ENHANCED TRANSFORMER-BASED DEEP KERNEL FUSED SELF ATTENTION MODEL FOR LUNG NODULE SEGMENTATION AND CLASSIFICATION
Saritha, R. Rani
Gunasundari, R.
ARCHIVES FOR TECHNICAL SCIENCES, 2024, (31): : 175 - 191
[49] Self-Supervised Point Cloud Understanding via Mask Transformer and Contrastive Learning
Wang, Di
Yang, Zhi-Xin
IEEE ROBOTICS AND AUTOMATION LETTERS, 2023, 8 (01) : 184 - 191
[50] Underwater Image Enhancement via Adaptive Group Attention-Based Multiscale Cascade Transformer
Huang, Zhixiong
Li, Jinjiang
Hua, Zhen
Fan, Linwei
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2022, 71

← 1 2 3 4 5 →