Neighborhood Attention Transformer

被引:115
|
作者
Hassani, Ali [1 ,2 ]
Walton, Steven [1 ,2 ]
Li, Jiachen [1 ,2 ]
Li, Shen [4 ]
Shi, Humphrey [1 ,2 ,3 ]
机构
[1] Univ Oregon, SHI Labs, Eugene, OR 97403 USA
[2] UIUC, Champaign, IL 61801 USA
[3] Picsart AI Res PAIR, New York, NY USA
[4] Meta Facebook AI, Menlo Pk, CA USA
关键词
D O I
10.1109/CVPR52729.2023.00599
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present Neighborhood Attention (NA), the first efficient and scalable sliding window attention mechanism for vision. NA is a pixel-wise operation, localizing self attention (SA) to the nearest neighboring pixels, and therefore enjoys a linear time and space complexity compared to the quadratic complexity of SA. The sliding window pattern allows NA's receptive field to grow without needing extra pixel shifts, and preserves translational equivariance, unlike Swin Transformer's Window Self Attention (WSA). We develop NATTEN (Neighborhood Attention Extension), a Python package with efficient C++ and CUDA kernels, which allows NA to run up to 40% faster than Swin's WSA while using up to 25% less memory. We further present Neighborhood Attention Transformer (NAT), a new hierarchical transformer design based on NA that boosts image classification and downstream vision performance. Experimental results on NAT are competitive; NAT-Tiny reaches 83.2% top-1 accuracy on ImageNet, 51.4% mAP on MS-COCO and 48.4% mIoU on ADE20K, which is 1.9% ImageNet accuracy, 1.0% COCO mAP, and 2.6% ADE20K mIoU improvement over a Swin model with similar size. To support more research based on sliding window attention, we open source our project and release our checkpoints.
引用
收藏
页码:6185 / 6194
页数:10
相关论文
共 50 条
  • [41] CONMW TRANSFORMER: A GENERAL VISION TRANSFORMER BACKBONE WITH MERGED-WINDOW ATTENTION
    Li, Ang
    Jiao, Jichao
    Li, Ning
    Qi, Wangjing
    Xu, Wei
    Pang, Min
    2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 1551 - 1555
  • [42] Slide-Transformer: Hierarchical Vision Transformer with Local Self-Attention
    Pan, Xuran
    Ye, Tianzhu
    Xia, Zhuofan
    Song, Shiji
    Huang, Gao
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2082 - 2091
  • [43] Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention
    Wu, Sitong
    Wu, Tianyi
    Tan, Haoru
    Guo, Guodong
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 2731 - 2739
  • [44] Cross-Attention Transformer for Video Interpolation
    Kim, Hannah Halin
    Yu, Shuzhi
    Yuan, Shuai
    Tomasi, Carlo
    COMPUTER VISION - ACCV 2022 WORKSHOPS, 2023, 13848 : 325 - 342
  • [45] Bayesian Transformer Using Disentangled Mask Attention
    Chien, Jen-Tzung
    Huang, Yu-Han
    INTERSPEECH 2022, 2022, : 1761 - 1765
  • [46] Temporal attention augmented transformer Hawkes process
    Zhang, Lu-ning
    Liu, Jian-wei
    Song, Zhi-yan
    Zuo, Xin
    NEURAL COMPUTING & APPLICATIONS, 2022, 34 (05): : 3795 - 3809
  • [47] Transformer Uncertainty Estimation with Hierarchical Stochastic Attention
    Pei, Jiahuan
    Wang, Cheng
    Szarvas, Gyorgy
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 11147 - 11155
  • [48] A Dual-Attention Transformer Network for Pansharpening
    Wu, Kun
    Yang, Xiaomin
    Nie, Zihao
    Li, Haoran
    Jeon, Gwanggil
    IEEE SENSORS JOURNAL, 2024, 24 (05) : 5500 - 5511
  • [49] MTAtrack: Multilevel transformer attention for visual tracking
    An, Dong
    Zhang, Fan
    Zhao, Yuqian
    Luo, Biao
    Yang, Chunhua
    Chen, Baifan
    Yu, Lingli
    OPTICS AND LASER TECHNOLOGY, 2023, 166
  • [50] Cross Attention with Monotonic Alignment for Speech Transformer
    Zhao, Yingzhu
    Ni, Chongjia
    Leung, Cheung-Chi
    Joty, Shafiq
    Chng, Eng Siong
    Ma, Bin
    INTERSPEECH 2020, 2020, : 5031 - 5035