LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference

被引：429

作者：

Graham, Ben

El-Nouby, Alaaeldin

Touvron, Hugo

Stock, Pierre

Joulin, Armand

Jegou, Herve

Douze, Matthijs

机构：

来源：

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021) | 2021年

关键词：

D O I：

10.1109/ICCV48922.2021.01204

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We design a family of image classification architectures that optimize the trade-off between accuracy and efficiency in a high-speed regime. Our work exploits recent findings in attention-based architectures, which are competitive on highly parallel processing hardware. We revisit principles from the extensive literature on convolutional neural networks to apply them to transformers, in particular activation maps with decreasing resolutions. We also introduce the attention bias, a new way to integrate positional information in vision transformers. As a result, we propose LeViT: a hybrid neural network for fast inference image classification. We consider different measures of efficiency on different hardware platforms, so as to best reflect a wide range of application scenarios. Our extensive experiments empirically validate our technical choices and show they are suitable to most architectures. Overall, LeViT significantly outperforms existing convnets and vision transformers with respect to the speed/accuracy tradeoff. For example, at 80% ImageNet top-1 accuracy, LeViT is 5 times faster than EfficientNet on CPU. We release the code at https://github.com/facebookresearch/LeViT.

引用

页码：12239 / 12249

页数：11

共 49 条

[1] LeViT-UNet: Make Faster Encoders with Transformer for Medical Image Segmentation
Xu, Guoping
Zhang, Xuan
He, Xinwei
Wu, Xinglong
PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT VIII, 2024, 14432 : 42 - 53
[2] Multi-tailed vision transformer for efficient inference
Wang, Yunke
Du, Bo
Wang, Wenyuan
Xu, Chang
NEURAL NETWORKS, 2024, 174
[3] ViTA: A Vision Transformer Inference Accelerator for Edge Applications
Nag, Shashank
Datta, Gourav
Kundu, Souvik
Chandrachoodan, Nitin
Beerel, Peter A.
2023 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, ISCAS, 2023,
[4] RESEARCH ON IMAGE RECOGNITION OF ETHNIC MINORITY CLOTHING BASED ON IMPROVED VISION TRANSFORMER
Wang, Taishen
Wen, Bin
MATHEMATICAL FOUNDATIONS OF COMPUTING, 2024, 7 (01): : 84 - 97
[5] MDP: Model Decomposition and Parallelization of Vision Transformer for Distributed Edge Inference
Wang, Weiyan
Zhang, Yiming
Jin, Yilun
Tian, Han
Chen, Li
2023 19TH INTERNATIONAL CONFERENCE ON MOBILITY, SENSING AND NETWORKING, MSN 2023, 2023, : 570 - 578
[6] Faster, Stronger, and More Interpretable: Massive Transformer Architectures for Vision-Language Tasks
Chen, Tong
Liu, Sicong
Chen, Zhiran
Hu, Wenyan
Chen, Dachi
Wang, Yuanxin
Lyu, Qi
Le, Cindy X.
Wang, Wenping
ADVANCES IN ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING, 2023, 3 (03): : 1369 - 1388
[7] Towards Efficient Vision Transformer Inference: A First Study of Transformers on Mobile Devices
Wang, Xudong
Zhang, Li Lyna
Wang, Yang
Yang, Mao
PROCEEDINGS OF THE 2022 THE 23RD ANNUAL INTERNATIONAL WORKSHOP ON MOBILE COMPUTING SYSTEMS AND APPLICATIONS (HOTMOBILE '22), 2022, : 1 - 7
[8] Game Robot's Vision Based on Faster RCNN
Liu, Shuai
Zheng, Bin
Zhao, Yongting
Guo, Bin
2018 CHINESE AUTOMATION CONGRESS (CAC), 2018, : 2472 - 2476
[9] Removing Rows and Columns of Tokens in Vision Transformer Enables Faster Dense Prediction Without Retraining
Su, Diwei
Fei, Cheng
Luo, Jianxu
COMPUTER VISION - ECCV 2024, PT LXXII, 2025, 15130 : 325 - 341
[10] Token Compensator: Altering Inference Cost of Vision Transformer Without Re-tuning
Jiel, Shibo
Tang, Yehui
Guo, Jianyuan
Deng, Zhi-Hong
Hang, Kai
Wang, Yunhe
COMPUTER VISION - ECCV 2024, PT XVI, 2025, 15074 : 76 - 94

← 1 2 3 4 5 →