DiagSWin: A multi-scale vision transformer with diagonal-shaped windows for object detection and segmentation

被引:0
|
作者
Li, Ke [1 ]
Wang, Di [1 ]
Liu, Gang [1 ]
Zhu, Wenxuan [1 ]
Zhong, Haodi [1 ]
Wang, Quan [1 ]
机构
[1] Xidian Univ, Key Lab Smart Human Comp Interact & Wearable Techn, Xian 710071, Peoples R China
关键词
Vision transformer; Multi-scale; Diagonal-shaped windows; Object detection and semantic segmentation;
D O I
10.1016/j.neunet.2024.106653
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, Vision Transformer and its variants have demonstrated remarkable performance on various computer vision tasks, thanks to its competence in capturing global visual dependencies through self-attention. However, global self-attention suffers from high computational cost due to quadratic computational overhead, especially for the high-resolution vision tasks (e.g., object detection and semantic segmentation). Many recent works have attempted to reduce the cost by applying fine-grained local attention, but these approaches cripple the long-range modeling power of the original self-attention mechanism. Furthermore, these approaches usually have similar receptive fields within each layer, thus limiting the ability of each self-attention layer to capture multi-scale features, resulting in performance degradation when handling images with objects of different scales. To address these issues, we develop the Diagonal-shaped Window (DiagSWin) attention mechanism for modeling attentions in diagonal regions at hybrid scales per attention layer. The key idea of DiagSWin attention is to inject multi-scale receptive field sizes into tokens: before computing the self-attention matrix, each token attends its closest surrounding tokens at fine granularity and the tokens far away at coarse granularity. This mechanism is able to effectively capture multi-scale context information while reducing computational complexity. With DiagSwin attention, we present a new variant of Vision Transformer models, called DiagSWin Transformers, and demonstrate their superiority in extensive experiments across various tasks. Specifically, the DiagSwin Transformer with a large size achieves 84.4% Top-1 accuracy and outperforms the SOTA CSWin Transformer on ImageNet with 40% fewer model size and computation cost. When employed as backbones, DiagSWin Transformers achieve significant improvements over the current SOTA modules. In addition, our DiagSWin-Base model yields 51.1 box mAP and 45.8 mask mAP on COCO for object detection and segmentation, and 52.3 mIoU on the ADE20K for semantic segmentation.
引用
收藏
页数:14
相关论文
共 50 条
  • [21] Multi-scale Knowledge Transfer Vision Transformer for 3D vessel shape segmentation
    Hua, Michael J.
    Wu, Junjie
    Zhong, Zichun
    COMPUTERS & GRAPHICS-UK, 2024, 122
  • [22] CascadeMedSeg: integrating pyramid vision transformer with multi-scale fusion for precise medical image segmentation
    Li, Junwei
    Sun, Shengfeng
    Li, Shijie
    Xia, Ruixue
    SIGNAL IMAGE AND VIDEO PROCESSING, 2024, 18 (12) : 9067 - 9079
  • [23] Global and Local Multi-scale Feature Fusion for Object Detection and Semantic Segmentation
    Lim, Young-Chul
    Kang, Minsung
    2019 30TH IEEE INTELLIGENT VEHICLES SYMPOSIUM (IV19), 2019, : 2557 - 2562
  • [24] Oil spill detection: SAR multi-scale segmentation & object features evaluation
    Topouzelis, K
    Karathanassi, V
    Pavlakis, P
    Rokos, D
    REMOTE SENSING OF THE OCEAN AND SEA ICE 2002, 2002, 4880 : 77 - 87
  • [25] Data-efficient multi-scale fusion vision transformer
    Tang, Hao
    Liu, Dawei
    Shen, Chengchao
    PATTERN RECOGNITION, 2025, 161
  • [26] Fusion multi-scale Transformer skin lesion segmentation algorithm
    Liang L.-M.
    Zhou L.-S.
    Yin J.
    Sheng X.-Q.
    Jilin Daxue Xuebao (Gongxueban)/Journal of Jilin University (Engineering and Technology Edition), 2024, 54 (04): : 1086 - 1098
  • [27] Multi-scale nested UNet with transformer for colorectal polyp segmentation
    Wang, Zenan
    Liu, Zhen
    Yu, Jianfeng
    Gao, Yingxin
    Liu, Ming
    JOURNAL OF APPLIED CLINICAL MEDICAL PHYSICS, 2024, 25 (06):
  • [28] Multi-Scale Object Detection by Clustering Lines
    Ommer, Bjoern
    Malik, Jitendra
    2009 IEEE 12TH INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2009, : 484 - 491
  • [29] Feature Enhancement for Multi-scale Object Detection
    Zheng, Huicheng
    Chen, Jiajie
    Chen, Lvran
    Li, Ye
    Yan, Zhiwei
    NEURAL PROCESSING LETTERS, 2020, 51 (02) : 1907 - 1919
  • [30] Selective Multi-scale Learning for Object Detection
    Chen, Junliang
    Lu, Weizeng
    Shen, Linlin
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2021, PT II, 2021, 12892 : 3 - 14