Transformer Dissection: An Unified Understanding for Transformer's Attention via the Lens of Kernel

被引:0
|
作者
Tsai, Yao-Hung Hubert [1 ]
Bai, Shaojie [1 ]
Yamada, Makoto [3 ,4 ]
Morency, Louis-Philippe [2 ]
Salakhutdinov, Ruslan [1 ]
机构
[1] Carnegie Mellon Univ, Machine Learning Dept, Pittsburgh, PA 15213 USA
[2] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA
[3] Kyoto Univ, Kyoto, Japan
[4] RIKEN AIP, Wako, Saitama, Japan
基金
美国国家卫生研究院;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Transformer is a powerful architecture that achieves superior performance on various sequence learning tasks, including neural machine translation, language understanding, and sequence prediction. At the core of the Transformer is the attention mechanism, which concurrently processes all inputs in the streams. In this paper, we present a new formulation of attention via the lens of the kernel. To be more precise, we realize that the attention can be seen as applying kernel smoother over the inputs with the kernel scores being the similarities between inputs. This new formulation gives us a better way to understand individual components of the Transformer's attention, such as the better way to integrate the positional embedding. Another important advantage of our kernel-based formulation is that it paves the way to a larger space of composing Transformer's attention. As an example, we propose a new variant of Transformer's attention which models the input as a product of symmetric kernels. This approach achieves competitive performance to the current state of the art model with less computation. In our experiments, we empirically study different kernel construction strategies on two widely used tasks: neural machine translation and sequence prediction.
引用
收藏
页码:4344 / 4353
页数:10
相关论文
共 50 条
  • [41] PCTDepth: Exploiting Parallel CNNs and Transformer via Dual Attention for Monocular Depth Estimation
    Chenxing Xia
    Xiuzhen Duan
    Xiuju Gao
    Bin Ge
    Kuan-Ching Li
    Xianjin Fang
    Yan Zhang
    Ke Yang
    Neural Processing Letters, 56
  • [42] Scale-Insensitive Object Detection via Attention Feature Pyramid Transformer Network
    Li, Lingling
    Zheng, Changwen
    Mao, Cunli
    Deng, Haibo
    Jin, Taisong
    NEURAL PROCESSING LETTERS, 2022, 54 (01) : 581 - 595
  • [43] EFFECTIVE IMAGE TAMPERING LOCALIZATION VIA ENHANCED TRANSFORMER AND CO-ATTENTION FUSION
    Guo, Kun
    Zhu, Haochen
    Cao, Gang
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 4895 - 4899
  • [44] Improving Multimodal Named Entity Recognition via Entity Span Detection with Unified Multimodal Transformer
    Yu, Jianfei
    Jiang, Jing
    Yang, Li
    Xia, Rui
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 3342 - 3352
  • [45] Speech recognition based on the transformer's multi-head attention in Arabic
    Mahmoudi O.
    Filali-Bouami M.
    Benchat M.
    International Journal of Speech Technology, 2024, 27 (01) : 211 - 223
  • [46] A coordinate attention enhanced swin transformer for handwriting recognition of Parkinson's disease
    Wang, Nana
    Niu, Xuesen
    Yuan, Yiyang
    Sun, Yunze
    Li, Ran
    You, Guoliang
    Zhao, Aite
    IET IMAGE PROCESSING, 2023, 17 (09) : 2686 - 2697
  • [47] Exploring high-quality image deraining Transformer via effective large kernel attentionExploring high-quality image deraining Transformer via effective large kernel attentionH. Dong et al.
    Haobo Dong
    Tianyu Song
    Xuanyu Qi
    Jiyu Jin
    Guiyue Jin
    Lei Fan
    The Visual Computer, 2025, 41 (4) : 2545 - 2561
  • [48] ENHANCED TRANSFORMER-BASED DEEP KERNEL FUSED SELF ATTENTION MODEL FOR LUNG NODULE SEGMENTATION AND CLASSIFICATION
    Saritha, R. Rani
    Gunasundari, R.
    ARCHIVES FOR TECHNICAL SCIENCES, 2024, (31): : 175 - 191
  • [49] Self-Supervised Point Cloud Understanding via Mask Transformer and Contrastive Learning
    Wang, Di
    Yang, Zhi-Xin
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2023, 8 (01) : 184 - 191
  • [50] Underwater Image Enhancement via Adaptive Group Attention-Based Multiscale Cascade Transformer
    Huang, Zhixiong
    Li, Jinjiang
    Hua, Zhen
    Fan, Linwei
    IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2022, 71