LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference

被引:429
|
作者
Graham, Ben
El-Nouby, Alaaeldin
Touvron, Hugo
Stock, Pierre
Joulin, Armand
Jegou, Herve
Douze, Matthijs
机构
关键词
D O I
10.1109/ICCV48922.2021.01204
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We design a family of image classification architectures that optimize the trade-off between accuracy and efficiency in a high-speed regime. Our work exploits recent findings in attention-based architectures, which are competitive on highly parallel processing hardware. We revisit principles from the extensive literature on convolutional neural networks to apply them to transformers, in particular activation maps with decreasing resolutions. We also introduce the attention bias, a new way to integrate positional information in vision transformers. As a result, we propose LeViT: a hybrid neural network for fast inference image classification. We consider different measures of efficiency on different hardware platforms, so as to best reflect a wide range of application scenarios. Our extensive experiments empirically validate our technical choices and show they are suitable to most architectures. Overall, LeViT significantly outperforms existing convnets and vision transformers with respect to the speed/accuracy tradeoff. For example, at 80% ImageNet top-1 accuracy, LeViT is 5 times faster than EfficientNet on CPU. We release the code at https://github.com/facebookresearch/LeViT.
引用
收藏
页码:12239 / 12249
页数:11
相关论文
共 49 条
  • [1] LeViT-UNet: Make Faster Encoders with Transformer for Medical Image Segmentation
    Xu, Guoping
    Zhang, Xuan
    He, Xinwei
    Wu, Xinglong
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT VIII, 2024, 14432 : 42 - 53
  • [2] Multi-tailed vision transformer for efficient inference
    Wang, Yunke
    Du, Bo
    Wang, Wenyuan
    Xu, Chang
    NEURAL NETWORKS, 2024, 174
  • [3] ViTA: A Vision Transformer Inference Accelerator for Edge Applications
    Nag, Shashank
    Datta, Gourav
    Kundu, Souvik
    Chandrachoodan, Nitin
    Beerel, Peter A.
    2023 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, ISCAS, 2023,
  • [4] RESEARCH ON IMAGE RECOGNITION OF ETHNIC MINORITY CLOTHING BASED ON IMPROVED VISION TRANSFORMER
    Wang, Taishen
    Wen, Bin
    MATHEMATICAL FOUNDATIONS OF COMPUTING, 2024, 7 (01): : 84 - 97
  • [5] MDP: Model Decomposition and Parallelization of Vision Transformer for Distributed Edge Inference
    Wang, Weiyan
    Zhang, Yiming
    Jin, Yilun
    Tian, Han
    Chen, Li
    2023 19TH INTERNATIONAL CONFERENCE ON MOBILITY, SENSING AND NETWORKING, MSN 2023, 2023, : 570 - 578
  • [6] Faster, Stronger, and More Interpretable: Massive Transformer Architectures for Vision-Language Tasks
    Chen, Tong
    Liu, Sicong
    Chen, Zhiran
    Hu, Wenyan
    Chen, Dachi
    Wang, Yuanxin
    Lyu, Qi
    Le, Cindy X.
    Wang, Wenping
    ADVANCES IN ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING, 2023, 3 (03): : 1369 - 1388
  • [7] Towards Efficient Vision Transformer Inference: A First Study of Transformers on Mobile Devices
    Wang, Xudong
    Zhang, Li Lyna
    Wang, Yang
    Yang, Mao
    PROCEEDINGS OF THE 2022 THE 23RD ANNUAL INTERNATIONAL WORKSHOP ON MOBILE COMPUTING SYSTEMS AND APPLICATIONS (HOTMOBILE '22), 2022, : 1 - 7
  • [8] Game Robot's Vision Based on Faster RCNN
    Liu, Shuai
    Zheng, Bin
    Zhao, Yongting
    Guo, Bin
    2018 CHINESE AUTOMATION CONGRESS (CAC), 2018, : 2472 - 2476
  • [9] Removing Rows and Columns of Tokens in Vision Transformer Enables Faster Dense Prediction Without Retraining
    Su, Diwei
    Fei, Cheng
    Luo, Jianxu
    COMPUTER VISION - ECCV 2024, PT LXXII, 2025, 15130 : 325 - 341
  • [10] Token Compensator: Altering Inference Cost of Vision Transformer Without Re-tuning
    Jiel, Shibo
    Tang, Yehui
    Guo, Jianyuan
    Deng, Zhi-Hong
    Hang, Kai
    Wang, Yunhe
    COMPUTER VISION - ECCV 2024, PT XVI, 2025, 15074 : 76 - 94