How Does Attention Work in Vision Transformers? A Visual Analytics Attempt

被引:9
|
作者
Li, Yiran [1 ]
Wang, Junpeng [2 ]
Dai, Xin [2 ]
Wang, Liang [2 ]
Yeh, Chin-Chia Michael [2 ]
Zheng, Yan [2 ]
Zhang, Wei [2 ]
Ma, Kwan-Liu [1 ]
机构
[1] Univ Calif Davis, Davis, CA 95616 USA
[2] Visa Res, Palo Alto, CA 94301 USA
关键词
Head; Transformers; Visual analytics; Task analysis; Measurement; Heating systems; Deep learning; explainable artificial intelligence; multi-head self-attention; vision transformer; visual analytics;
D O I
10.1109/TVCG.2023.3261935
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Vision transformer (ViT) expands the success of transformer models from sequential data to images. The model decomposes an image into many smaller patches and arranges them into a sequence. Multi-head self-attentions are then applied to the sequence to learn the attention between patches. Despite many successful interpretations of transformers on sequential data, little effort has been devoted to the interpretation of ViTs, and many questions remain unanswered. For example, among the numerous attention heads, which one is more important? How strong are individual patches attending to their spatial neighbors in different heads? What attention patterns have individual heads learned? In this work, we answer these questions through a visual analytics approach. Specifically, we first identify what heads are more important in ViTs by introducing multiple pruning-based metrics. Then, we profile the spatial distribution of attention strengths between patches inside individual heads, as well as the trend of attention strengths across attention layers. Third, using an autoencoder-based learning solution, we summarize all possible attention patterns that individual heads could learn. Examining the attention strengths and patterns of the important heads, we answer why they are important. Through concrete case studies with experienced deep learning experts on multiple ViTs, we validate the effectiveness of our solution that deepens the understanding of ViTs from head importance, head attention strength, and head attention pattern.
引用
收藏
页码:2888 / 2900
页数:13
相关论文
共 50 条
  • [41] Visual Function, Visual Attention, and Mobility Performance in Low Vision
    Leat, Susan J.
    Lovie-Kitchin, Jan E.
    OPTOMETRY AND VISION SCIENCE, 2008, 85 (11) : 1049 - 1056
  • [42] How does it work?
    T P Hyde
    British Dental Journal, 2006, 200 : 477 - 477
  • [43] But how does it work?
    May, CF
    CIVIL ENGINEERING, 1999, 69 (09): : 6 - 6
  • [44] HOW DOES IT WORK
    JACOBY, HC
    POPULAR COMPUTING, 1982, 2 (01): : 14 - 14
  • [45] Exploring Lightweight Hierarchical Vision Transformers for Efficient Visual Tracking
    Kang, Ben
    Chen, Xin
    Wang, Dong
    Peng, Houwen
    Lu, Huchuan
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 9578 - 9587
  • [46] MaAST: Map Attention with Semantic Transformers for Efficient Visual Navigation
    Seymour, Zachary
    Thopalli, Kowshik
    Mithun, Niluthpol
    Chiu, Han-Pang
    Samarasekera, Supun
    Kumar, Rakesh
    2021 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2021), 2021, : 13223 - 13230
  • [47] Does it work and how does it work? A Belgian view
    Malherbe, J
    TAXPAYER PROTECTION IN THE EUROPEAN UNION, 1998, 5 : 27 - 31
  • [48] Vision Research special issue on "Visual attention"
    Carrasco, Marisa
    Eckstein, Miguel
    Krauzlis, Rich
    Verghese, Preeti
    VISION RESEARCH, 2012, 74 : 1 - 1
  • [49] Visual Selective Attention Model for Robot Vision
    Heinen, Milton Roberto
    Engel, Paulo Martins
    2008 5TH LATIN AMERICAN ROBOTICS SYMPOSIUM (LARS 2008), 2008, : 24 - 29
  • [50] VISION AND VISUAL ENVIRONMENT FOR VDT WORK
    KANAYA, S
    ERGONOMICS, 1990, 33 (06) : 775 - 785