Adaptive search for broad attention based vision transformers

被引：0

作者：

Li, Nannan ^{[1
,2
,3
]}

Chen, Yaran ^{[1
,2
]}

Zhao, Dongbin ^{[1
,2
]}

机构：

[1] Chinese Acad Sci, Inst Automat, State Key Lab Multimodal Artificial Intelligence S, Beijing 100190, Peoples R China

[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing 100049, Peoples R China

[3] Tsinghua Univ, Dept Automat, BNRIST, Beijing 100084, Peoples R China

来源：

NEUROCOMPUTING | 2025年 / 611卷

基金：

中国国家自然科学基金;

关键词：

Vision transformer; Adaptive architecture search; Broad search space; Image classification; Broad learning;

D O I：

10.1016/j.neucom.2024.128696

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Vision Transformer (ViT) has prevailed among computer vision tasks for its powerful capability of image representation recently. Frustratingly, the manual design of efficient architectures for ViTs can be laborious, often involving repetitive trial and error processes. Furthermore, the exploration of lightweight ViTs remains limited, resulting in inferior performance compared to convolutional neural networks. To tackle these challenges, we propose Adaptive Search for Broad attention based Vision Transformers, called ASB, which automates the design of efficient ViT architectures by utilizing the broad search space and an adaptive evolutionary algorithm. The broad search space facilitates the exploration of a novel connection paradigm, enabling more comprehensive integration of attention information to improve ViT performance. Additionally, an efficient adaptive evolutionary algorithm is developed to efficiently explore architectures by dynamically learning the probability distribution of candidate operators. Our experimental results demonstrate that the adaptive evolution in ASB efficiently learns excellent lightweight models, achieving a 55% improvement in convergence speed over traditional evolutionary algorithms. Moreover, the effectiveness of ASB is validated across several visual tasks. For instance, on ImageNet classification, the searched model attains a performance of 77.8% with 6.5M parameters and outperforms state-of-the-art models, including EfficientNet and EfficientViT networks. On mobile COCO panoptic segmentation, our approach delivers 43.7% PQ. On mobile ADE20K semantic segmentation, our method attains 40.9% mIoU. The code and pre-trained models will be available soon in ASB-Code.

引用

页数：12

共 50 条

[31] Introducing Attention Mechanism for EEG Signals: Emotion Recognition with Vision Transformers
Arjun
Rajpoot, Aniket Singh
Panicker, Mahesh Raveendranatha
2021 43RD ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE & BIOLOGY SOCIETY (EMBC), 2021, : 5723 - 5726
[32] You Only Need Less Attention at Each Stage in Vision Transformers
Zhang, Shuoxi
Liu, Hanpeng
Lin, Stephen
He, Kun
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 6057 - 6066
[33] ReViT: Enhancing vision transformers feature diversity with attention residual connections
Diko, Anxhelo
Avola, Danilo
Cascio, Marco
Cinque, Luigi
PATTERN RECOGNITION, 2024, 156
[34] VSA: Learning Varied-Size Window Attention in Vision Transformers
Zhang, Qiming
Xu, Yufei
Zhang, Jing
Tao, Dacheng
COMPUTER VISION, ECCV 2022, PT XXV, 2022, 13685 : 466 - 483
[35] Unraveling Attention via Convex Duality: Analysis and Interpretations of Vision Transformers
Sahiner, Arda
Ergen, Tolga
Ozturkler, Batu
Pauly, John
Mardani, Morteza
Pilanci, Mert
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022, : 19050 - 19088
[36] Presentation attack detection based on two-stream vision transformers with self-attention fusion
Peng, Fei
Meng, Shao-hua
Long, Min
JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2022, 85
[37] Optimizing Strawberry Disease and Quality Detection with Vision Transformers and Attention-Based Convolutional Neural Networks
Aghamohammadesmaeilketabforoosh, Kimia
Nikan, Soodeh
Antonini, Giorgio
Pearce, Joshua M.
FOODS, 2024, 13 (12)
[38] Driver attention prediction based on convolution and transformers
Gou, Chao
Zhou, Yuchen
Li, Dan
JOURNAL OF SUPERCOMPUTING, 2022, 78 (06): : 8268 - 8284
[39] Driver attention prediction based on convolution and transformers
Chao Gou
Yuchen Zhou
Dan Li
The Journal of Supercomputing, 2022, 78 : 8268 - 8284
[40] PaCa-ViT: Learning Patch-to-Cluster Attention in Vision Transformers
Grainger, Ryan
Paniagua, Thomas
Song, Xi
Cuntoor, Naresh
Lee, Mun Wai
Wu, Tianfu
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 18568 - 18578

← 1 2 3 4 5 →