LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference

被引:429
|
作者
Graham, Ben
El-Nouby, Alaaeldin
Touvron, Hugo
Stock, Pierre
Joulin, Armand
Jegou, Herve
Douze, Matthijs
机构
关键词
D O I
10.1109/ICCV48922.2021.01204
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We design a family of image classification architectures that optimize the trade-off between accuracy and efficiency in a high-speed regime. Our work exploits recent findings in attention-based architectures, which are competitive on highly parallel processing hardware. We revisit principles from the extensive literature on convolutional neural networks to apply them to transformers, in particular activation maps with decreasing resolutions. We also introduce the attention bias, a new way to integrate positional information in vision transformers. As a result, we propose LeViT: a hybrid neural network for fast inference image classification. We consider different measures of efficiency on different hardware platforms, so as to best reflect a wide range of application scenarios. Our extensive experiments empirically validate our technical choices and show they are suitable to most architectures. Overall, LeViT significantly outperforms existing convnets and vision transformers with respect to the speed/accuracy tradeoff. For example, at 80% ImageNet top-1 accuracy, LeViT is 5 times faster than EfficientNet on CPU. We release the code at https://github.com/facebookresearch/LeViT.
引用
收藏
页码:12239 / 12249
页数:11
相关论文
共 49 条
  • [41] 基于S-YOLO V5和Vision Transformer的视频内容描述算法
    徐鹏
    李铁柱
    职保平
    印刷与数字媒体技术研究, 2023, (04) : 212 - 222
  • [42] Computer-aided diagnosis of Alzheimer's disease and neurocognitive disorders with multimodal Bi-Vision Transformer (BiViT)
    Shah, S. Muhammad Ahmed Hassan
    Khan, Muhammad Qasim
    Rizwan, Atif
    Jan, Sana Ullah
    Samee, Nagwan Abdel
    Jamjoom, Mona M.
    PATTERN ANALYSIS AND APPLICATIONS, 2024, 27 (03)
  • [43] EQ-ViT: Algorithm-Hardware Co-Design for End-to-End Acceleration of Real-Time Vision Transformer Inference on Versal ACAP Architecture
    Dong, Peiyan
    Zhuang, Jinming
    Yang, Zhuoping
    Ji, Shixin
    Li, Yanyu
    Xu, Dongkuan
    Huang, Heng
    Hu, Jingtong
    Jones, Alex K.
    Shi, Yiyu
    Wang, Yanzhi
    Zhou, Peipei
    IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2024, 43 (11) : 3949 - 3960
  • [44] ViTAD: Leveraging modified vision transformer for Alzheimer's disease multi-stage classification from brain MRI scans
    Joy, Md. Ashif Mahmud
    Nasrin, Shamima
    Siddiqua, Ayesha
    Farid, Dewan Md.
    BRAIN RESEARCH, 2025, 1847
  • [45] SMIL-DeiT:Multiple Instance Learning and Self-supervised Vision Transformer network for Early Alzheimer's disease classification
    Yin, Yue
    Jin, Weikang
    Bai, Jing
    Liu, Ruotong
    Zhen, Haowei
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [46] Explainable Vision Transformer with Self-Supervised Learning to Predict Alzheimer's Disease Progression Using 18F-FDG PET
    Khatri, Uttam
    Kwon, Goo-Rak
    BIOENGINEERING-BASEL, 2023, 10 (10):
  • [47] OViTAD: Optimized Vision Transformer to Predict Various Stages of Alzheimer's Disease Using Resting-State fMRI and Structural MRI Data
    Sarraf, Saman
    Sarraf, Arman
    DeSouza, Danielle D. D.
    Anderson, John A. E.
    Kabia, Milton
    BRAIN SCIENCES, 2023, 13 (02)
  • [48] FGI-CogViT: Fuzzy Granule-based Interpretable Cognitive Vision Transformer for Early Detection of Alzheimer's Disease using MRI Scan Images
    Pramanik, Anima
    Sarker, Soumick
    Sarkar, Sobhan
    Bose, Indranil
    INFORMATION SYSTEMS FRONTIERS, 2024,
  • [49] Enhancing accuracy in Barrett's surveillance using artificial intelligence: A multimodal (white-light and narrow-band imaging) model comparing vision transformer and convolutional neural networks
    Tan, J. L.
    Pitawela, D.
    Chinnaratha, A.
    Chen, H-T
    Carneiro, G.
    Singh, R.
    JOURNAL OF GASTROENTEROLOGY AND HEPATOLOGY, 2023, 38 : 251 - 251