Compressed Video Action Recognition With Dual-Stream and Dual-Modal Transformer

被引:4
|
作者
Mou, Yuting [1 ]
Jiang, Xinghao [1 ]
Xu, Ke [1 ]
Sun, Tanfeng [1 ]
Wang, Zepeng [1 ]
机构
[1] Shanghai Jiao Tong Univ, Sch Elect Informat & Elect Engn, Natl Engn Lab Informat Content Anal Tech, Shanghai 200240, Peoples R China
基金
中国国家自然科学基金;
关键词
Compressed video; action recognition; NETWORK; EFFICIENCY;
D O I
10.1109/TCSVT.2023.3319140
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Compressed video action recognition offers the advantage of reducing decoding and inference time compared to the RGB domain. However, the compressed domain poses unique challenges with different types of frames (I-frames and P-frames). I-frames consistent with RGB are rich in frame information, but the redundant information may interfere with the recognition task. There are two modalities in P-frames, residual (R) and motion vector (MV). Although with less information, they can reflect the motion cue. To address these challenges and leverage the independent information from different frames and modalities, we propose a novel approach called Dual-Stream and Dual-Modal Transformer (DSDMT). Our approach consists of two streams: 1) The short-span P-frames stream contains temporal information. We propose the Dual-Modal Attention Module (DAM) to mine different modal variability in P-frames and complement the orthogonal feature vector. Besides, considering the sparsity of P-frames, we extract action features with Frame-level Patch Embedding (FPE) to avoid redundant computation. 2) The long-span I-frames stream extracts the global context feature of the entire video, including content and scene information. By fusing the global video context and local key-frame features, our model represents the action feature in terms of fine-grained and coarse-grained. We evaluated our proposed DSDMT on three public benchmarks with different scales: HMDB-51, UCF-101, and Kinetics-400. Ours achieve better performance with fewer Flops and lower latency. Our analysis shows that the independence and complements of the I-frames and P-frames extracted from the compressed video stream play a crucial role in action recognition.
引用
收藏
页码:3299 / 3312
页数:14
相关论文
共 50 条
  • [1] Evolutionary Dual-Stream Transformer
    Zhang, Ruohan
    Jiao, Licheng
    Li, Lingling
    Liu, Fang
    Liu, Xu
    Yang, Shuyuan
    IEEE TRANSACTIONS ON CYBERNETICS, 2024, 54 (04) : 2166 - 2178
  • [2] Feature Fusion for Dual-Stream Cooperative Action Recognition
    Chen, Dong
    Wu, Mengtao
    Zhang, Tao
    Li, Chuanqi
    IEEE ACCESS, 2023, 11 : 116732 - 116740
  • [3] Dual-stream cross-modality fusion transformer for RGB-D action recognition
    Liu, Zhen
    Cheng, Jun
    Liu, Libo
    Ren, Ziliang
    Zhang, Qieshi
    Song, Chengqun
    KNOWLEDGE-BASED SYSTEMS, 2022, 255
  • [4] A Video Action Recognition Method via Dual-Stream Feature Fusion Neural Network with Attention
    Han, Jianmin
    Li, Jie
    INTERNATIONAL JOURNAL OF UNCERTAINTY FUZZINESS AND KNOWLEDGE-BASED SYSTEMS, 2024, 32 (04) : 673 - 694
  • [5] Dual-stream Network for Visual Recognition
    Mao, Mingyuan
    Gao, Peng
    Zhang, Renrui
    Zheng, Honghui
    Ma, Teli
    Peng, Yan
    Ding, Errui
    Zhang, Baochang
    Han, Shumin
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [6] End-to-End Dual-Stream Transformer with a Parallel Encoder for Video Captioning
    Ran, Yuting
    Fang, Bin
    Chen, Lei
    Wei, Xuekai
    Xian, Weizhi
    Zhou, Mingliang
    JOURNAL OF CIRCUITS SYSTEMS AND COMPUTERS, 2024, 33 (04)
  • [7] ATTENTION-BASED DUAL-STREAM VISION TRANSFORMER FOR RADAR GAIT RECOGNITION
    Chen, Shiliang
    He, Wentao
    Ren, Jianfeng
    Jiang, Xudong
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 3668 - 3672
  • [8] DSTrans: Dual-Stream Transformer for Hyperspectral Image Restoration
    Yu, Dabing
    Li, Qingwu
    Wang, Xiaolin
    Zhang, Zhiliang
    Qian, Yixi
    Xu, Chang
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 3728 - 3738
  • [9] TDS-Net: Transformer enhanced dual-stream network for video Anomaly Detection
    Hussain, Adnan
    Ullah, Waseem
    Khan, Noman
    Khan, Zulfiqar Ahmad
    Kim, Min Je
    Baik, Sung Wook
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 256
  • [10] Depth-Aware Dual-Stream Interactive Transformer Network for Facial Expression Recognition
    Jiang, Yiben
    Yang, Xiao
    Fu, Keren
    Yang, Hongyu
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT XI, 2025, 15041 : 563 - 577