MgMViT: Multi-Granularity and Multi-Scale Vision Transformer for Efficient Action Recognition

被引:1
|
作者
Huo, Hua [1 ]
Li, Bingjie [1 ]
机构
[1] Henan Univ Sci & Technol, Informat Engn Coll, Luoyang 471000, Peoples R China
基金
中国国家自然科学基金;
关键词
action recognition; multi-granularity multi-scale fusion; vision transformer; efficiency;
D O I
10.3390/electronics13050948
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Nowadays, the field of video-based action recognition is rapidly developing. Although Vision Transformers (ViT) have made great progress in static image processing, they are not yet fully optimized for dynamic video applications. Convolutional Neural Networks (CNN) and related models perform exceptionally well in video action recognition. However, there are still some issues that cannot be ignored, such as high computational costs and large memory consumption. In the face of these issues, current research focuses on finding effective methods to improve model performance and overcome current limits. Therefore, we present a unique Vision Transformer model based on multi-granularity and multi-scale fusion to accomplish efficient action recognition, which is designed for action recognition in videos to effectively reduce computational costs and memory usage. Firstly, we devise a multi-scale, multi-granularity module that integrates with Transformer blocks. Secondly, a hierarchical structure is utilized to manage information at various scales, and we introduce multi-granularity on top of multi-scale, which allows for a selective choice of the number of tokens to enter the next computational step, thereby reducing redundant tokens. Thirdly, a coarse-fine granularity fusion layer is introduced to reduce the sequence length of tokens with lower information content. The above two mechanisms are combined to optimize the allocation of resources in the model, further emphasizing critical information and reducing redundancy, thereby minimizing computational costs. To assess our proposed approach, comprehensive experiments are conducted by using benchmark datasets in the action recognition domain. The experimental results demonstrate that our method has achieved state-of-the-art performance in terms of accuracy and efficiency.
引用
收藏
页数:16
相关论文
共 50 条
  • [21] DeepFake detection with multi-scale convolution and vision transformer
    Lin, Hao
    Huang, Wenmin
    Luo, Weiqi
    Lu, Wei
    DIGITAL SIGNAL PROCESSING, 2023, 134
  • [22] Transformer feature collapse of Temporal Action Detection via Multi-granularity Semantic Enhancement
    An, Xin
    Zhao, Peng
    Wang, Guiqin
    Zhao, Cong
    Yang, Shusen
    NEUROCOMPUTING, 2025, 626
  • [23] An efficient selector for multi-granularity attribute reduction
    Liu, Keyu
    Yang, Xibei
    Fujita, Hamido
    Liu, Dun
    Yang, Xin
    Qian, Yuhua
    INFORMATION SCIENCES, 2019, 505 : 457 - 472
  • [24] Multi-scale network via progressive multi-granularity attention for fine-grained visual classification
    An, Chen
    Wang, Xiaodong
    Wei, Zhiqiang
    Zhang, Ke
    Huang, Lei
    APPLIED SOFT COMPUTING, 2023, 146
  • [25] DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition
    Jiao, Jiayu
    Tang, Yu-Ming
    Lin, Kun-Yu
    Gao, Yipeng
    Ma, Andy J.
    Wang, Yaowei
    Zheng, Wei-Shi
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 8906 - 8919
  • [26] Multi-Scale Temporal Transformer For Speech Emotion Recognition
    Li, Zhipeng
    Xing, Xiaofen
    Fang, Yuanbo
    Zhang, Weibin
    Fan, Hengsheng
    Xu, Xiangmin
    INTERSPEECH 2023, 2023, : 3652 - 3656
  • [27] Gated Multi-Scale Transformer for Temporal Action Localization
    Yang, Jin
    Wei, Ping
    Ren, Ziyang
    Zheng, Nanning
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 5705 - 5717
  • [28] Multi-granularity Semantic Guided Transformer for Radiology Report Generation
    Song, Yu
    Hua, Xiaojin
    Mang, Kunli
    Zan, Hongying
    Lie, Runzhi
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT III, NLPCC 2024, 2025, 15361 : 458 - 471
  • [29] MSAPVT: a multi-scale attention pyramid vision transformer network for large-scale fruit recognition
    Rao, Yao
    Li, Chaofeng
    Xu, Feiran
    Guo, Ya
    JOURNAL OF FOOD MEASUREMENT AND CHARACTERIZATION, 2024, 18 (11) : 9233 - 9251
  • [30] MAGIC: Multi-granularity domain adaptation for text recognition
    Zhang, Jia-Ying
    Liu, Xiao-Qian
    Xue, Zhi-Yuan
    Luo, Xin
    Xu, Xin-Shun
    PATTERN RECOGNITION, 2025, 161