ScopeViT: Scale-Aware Vision Transformer

被引:4
|
作者
Nie, Xuesong [1 ]
Jin, Haoyuan [1 ]
Yan, Yunfeng [1 ]
Chen, Xi [2 ]
Zhu, Zhihang [1 ]
Qi, Donglian [1 ]
机构
[1] Zhejiang Univ, Hangzhou 310027, Peoples R China
[2] Univ Hong Kong, Hong Kong 999077, Peoples R China
关键词
Vision transformer; Multi-scale features; Efficient attention mechanism;
D O I
10.1016/j.patcog.2024.110470
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multi-scale features are essential for various vision tasks, such as classification, detection, and segmentation. Although Vision Transformers (ViTs) show remarkable success in capturing global features within an image, how to leverage multi-scale features in Transformers is not well explored. This paper proposes a scale-aware vision Transformer called ScopeViT that efficiently captures multi-granularity representations. Two novel attention with lightweight computation are introduced: Multi-Scale Self-Attention (MSSA) and Global-Scale Dilated Attention (GSDA). MSSA embeds visual tokens with different receptive fields into distinct attention heads, allowing the model to perceive various scales across the network. GSDA enhances model understanding of the global context through token-dilation operation, which reduces the number of tokens involved in attention computations. This dual attention method enables ScopeViT to "see"various scales throughout the entire network and effectively learn inter -object relationships, reducing heavy quadratic computational complexity. Extensive experiments demonstrate that ScopeViT achieves competitive complexity/accuracy tradeoffs compared to existing networks across a wide range of visual tasks. On the ImageNet-1K dataset, ScopeViT achieves a top-1 accuracy of 81.1%, using only 7.4M parameters and 2.0G FLOPs. Our approach outperforms Swin (ViT-based) by 1.9% accuracy while saving 42% of the parameters, outperforms MobileViTv2 (Hybridbased) with a 0.7% accuracy gain while using 50% of the computations, and also beats ConvNeXt V2 (ConvNet-based) by 0.8% with fewer parameters.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Scale-Aware Modulation Meet Transformer
    Lin, Weifeng
    Wu, Ziheng
    Chen, Jiayu
    Huang, Jun
    Jin, Lianwen
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 5992 - 6003
  • [2] An enhanced vision transformer with scale-aware and spatial-aware attention for thighbone fracture detection
    Guan B.
    Yao J.
    Zhang G.
    Neural Computing and Applications, 2024, 36 (19) : 11425 - 11438
  • [3] HSVLT: Hierarchical Scale-Aware Vision-Language Transformer for Multi-Label Image Classification
    Ouyang, Shuyi
    Wang, Hongyi
    Niu, Ziwei
    Bai, Zhenjia
    Xie, Shiao
    Xu, Yingying
    Tong, Ruofeng
    Chen, Yen-Wei
    Lin, Lanfen
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4768 - 4777
  • [4] YOLO-Drone:A Scale-Aware Detector for Drone Vision
    Yutong LI
    Miao MA
    Shichang LIU
    Chao YAO
    Longjiang GUO
    Chinese Journal of Electronics, 2024, 33 (04) : 1034 - 1045
  • [5] YOLO-Drone: A Scale-Aware Detector for Drone Vision
    Li, Yutong
    Ma, Miao
    Liu, Shichang
    Yao, Chao
    Guo, Longjiang
    CHINESE JOURNAL OF ELECTRONICS, 2024, 33 (04) : 1034 - 1045
  • [6] Scale-Aware Network with Scale Equivariance
    Ning, Mingqiang
    Tang, Jinsong
    Zhong, Heping
    Wu, Haoran
    Zhang, Peng
    Zhang, Zhisheng
    PHOTONICS, 2022, 9 (03)
  • [7] Multi-object tracking with scale-aware transformer and enhanced association strategy
    Xiang, Xuezhi
    Zhou, Xiankun
    Wang, Xinyao
    Zhai, Mingliang
    El Saddik, Abdulmotaleb
    MULTIMEDIA SYSTEMS, 2025, 31 (02)
  • [8] Scale-aware token-matching for transformer-based object detector
    Jung, Aecheon
    Hong, Sungeun
    Hyun, Yoonsuk
    PATTERN RECOGNITION LETTERS, 2024, 185 : 197 - 202
  • [9] Scale-aware shape manipulation
    Zheng LIU
    Wei-ming WANG
    Xiu-ping LIU
    Li-gang LIU
    Frontiers of Information Technology & Electronic Engineering, 2014, (09) : 764 - 775
  • [10] Scale-aware shape manipulation
    Zheng Liu
    Wei-ming Wang
    Xiu-ping Liu
    Li-gang Liu
    Journal of Zhejiang University SCIENCE C, 2014, 15 : 764 - 775