ReViT: Enhancing vision transformers feature diversity with attention residual connections

被引:3
|
作者
Diko, Anxhelo [1 ]
Avola, Danilo [1 ]
Cascio, Marco [1 ,2 ]
Cinque, Luigi [1 ]
机构
[1] Sapienza Univ Rome, Dept Comp Sci, Via Salaria 113, I-00198 Rome, Italy
[2] Univ Rome UnitelmaSapienza, Dept Law & Econ, Piazza Sassari 4, I-00161 Rome, Italy
关键词
Vision transformer; Feature collapse; Self-attention mechanism; Residual attention learning; Visual recognition;
D O I
10.1016/j.patcog.2024.110853
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision Transformer (ViT) self-attention mechanism is characterized by feature collapse in deeper layers, resulting in the vanishing of low-level visual features. However, such features can be helpful to accurately represent and identify elements within an image and increase the accuracy and robustness of vision-based recognition systems. Following this rationale, we propose a novel residual attention learning method for improving ViT-based architectures, increasing their visual feature diversity and model robustness. In this way, the proposed network can capture and preserve significant low-level features, providing more details about the elements within the scene being analyzed. The effectiveness and robustness of the presented method are evaluated on five image classification benchmarks, including ImageNet1k, CIFAR10, CIFAR100, Oxford Flowers-102, and Oxford-IIIT Pet, achieving improved performances. Additionally, experiments on the COCO2017 dataset show that the devised approach discovers and incorporates semantic and spatial relationships for object detection and instance segmentation when implemented into spatial-aware transformer models.
引用
收藏
页数:13
相关论文
共 50 条
  • [21] Enhancing Local Context of Histology Features in Vision Transformers
    Wood, Ruby
    Sirinukunwattana, Korsuk
    Domingo, Enric
    Sauer, Alexander
    Lafarge, Maxime W.
    Koelzer, Viktor H.
    Maughan, Timothy S.
    Rittscher, Jens
    ARTIFICIAL INTELLIGENCE OVER INFRARED IMAGES FOR MEDICAL APPLICATIONS AND MEDICAL IMAGE ASSISTED BIOMARKER DISCOVERY, 2022, 13602 : 154 - 163
  • [22] ViTs as backbones: Leveraging vision transformers for feature extraction
    Elharrouss, Omar
    Himeur, Yassine
    Mahmood, Yasir
    Alrabaee, Saed
    Ouamane, Abdelmalik
    Bensaali, Faycal
    Bechqito, Yassine
    Chouchane, Ammar
    INFORMATION FUSION, 2025, 118
  • [23] ReViT: Vision Transformer Accelerator With Reconfigurable Semantic-Aware Differential Attention
    Zou, Xiaofeng
    Chen, Cen
    Shao, Hongen
    Wang, Qinyu
    Zhuang, Xiaobin
    Li, Yangfan
    Li, Keqin
    IEEE TRANSACTIONS ON COMPUTERS, 2025, 74 (03) : 1079 - 1093
  • [24] ResViT: Residual Vision Transformers for Multimodal Medical Image Synthesis
    Dalmaz, Onat
    Yurt, Mahmut
    Cukur, Tolga
    IEEE TRANSACTIONS ON MEDICAL IMAGING, 2022, 41 (10) : 2598 - 2614
  • [25] Masked Image Residual Learning for Scaling Deeper Vision Transformers
    Huang, Guoxi
    Fu, Hongtao
    Bors, Adrian G.
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [26] Accumulated Trivial Attention Matters in Vision Transformers on Small Datasets
    Chen, Xiangyu
    Hu, Qinghao
    Li, Kaidong
    Zhong, Cuncong
    Wang, Guanghui
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 3973 - 3981
  • [27] An Attention-Based Token Pruning Method for Vision Transformers
    Luo, Kaicheng
    Li, Huaxiong
    Zhou, Xianzhong
    Huang, Bing
    ROUGH SETS, IJCRS 2022, 2022, 13633 : 274 - 288
  • [28] RAWAtten: Reconfigurable Accelerator for Window Attention in Hierarchical Vision Transformers
    Li, Wantong
    Luo, Yandong
    Yu, Shimeng
    2023 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION, DATE, 2023,
  • [29] Focal Attention for Long-Range Interactions in Vision Transformers
    Yang, Jianwei
    Li, Chunyuan
    Zhang, Pengchuan
    Dai, Xiyang
    Xiao, Bin
    Yuan, Lu
    Gao, Jianfeng
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [30] Enhancing furcation involvement classification on panoramic radiographs with vision transformers
    Zhang, Xuan
    Guo, Enting
    Liu, Xu
    Zhao, Hong
    Yang, Jie
    Li, Wen
    Wu, Wenlei
    Sun, Weibin
    BMC ORAL HEALTH, 2025, 25 (01):