Compressed Video Action Recognition With Dual-Stream and Dual-Modal Transformer

被引:4
|
作者
Mou, Yuting [1 ]
Jiang, Xinghao [1 ]
Xu, Ke [1 ]
Sun, Tanfeng [1 ]
Wang, Zepeng [1 ]
机构
[1] Shanghai Jiao Tong Univ, Sch Elect Informat & Elect Engn, Natl Engn Lab Informat Content Anal Tech, Shanghai 200240, Peoples R China
基金
中国国家自然科学基金;
关键词
Compressed video; action recognition; NETWORK; EFFICIENCY;
D O I
10.1109/TCSVT.2023.3319140
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Compressed video action recognition offers the advantage of reducing decoding and inference time compared to the RGB domain. However, the compressed domain poses unique challenges with different types of frames (I-frames and P-frames). I-frames consistent with RGB are rich in frame information, but the redundant information may interfere with the recognition task. There are two modalities in P-frames, residual (R) and motion vector (MV). Although with less information, they can reflect the motion cue. To address these challenges and leverage the independent information from different frames and modalities, we propose a novel approach called Dual-Stream and Dual-Modal Transformer (DSDMT). Our approach consists of two streams: 1) The short-span P-frames stream contains temporal information. We propose the Dual-Modal Attention Module (DAM) to mine different modal variability in P-frames and complement the orthogonal feature vector. Besides, considering the sparsity of P-frames, we extract action features with Frame-level Patch Embedding (FPE) to avoid redundant computation. 2) The long-span I-frames stream extracts the global context feature of the entire video, including content and scene information. By fusing the global video context and local key-frame features, our model represents the action feature in terms of fine-grained and coarse-grained. We evaluated our proposed DSDMT on three public benchmarks with different scales: HMDB-51, UCF-101, and Kinetics-400. Ours achieve better performance with fewer Flops and lower latency. Our analysis shows that the independence and complements of the I-frames and P-frames extracted from the compressed video stream play a crucial role in action recognition.
引用
收藏
页码:3299 / 3312
页数:14
相关论文
共 50 条
  • [41] Dual-Stream Feature Extraction Network Based on CNN and Transformer for Building Extraction
    Xia, Liegang
    Mi, Shulin
    Zhang, Junxia
    Luo, Jiancheng
    Shen, Zhanfeng
    Cheng, Yubin
    REMOTE SENSING, 2023, 15 (10)
  • [42] A Smart Dual-modal Aligned Transformer Deep Network for Robotic Grasp Detection
    Cang, Xin
    Zhang, Haojun
    Yang, Yuequan
    Cao, Zhiqiang
    Li, Fudong
    Zhu, Jiaming
    2024 14TH ASIAN CONTROL CONFERENCE, ASCC 2024, 2024, : 1230 - 1235
  • [43] Dual-stream of monodisperse droplet generator
    Wu, Xuecheng
    Lv, Qimeng
    Wu, Yingchun
    Li, Can
    Cen, Kefa
    CHEMICAL ENGINEERING SCIENCE, 2020, 223
  • [44] Dual-stream pyramid registration network
    Kang, Miao
    Hu, Xiaojun
    Huang, Weilin
    Scott, Matthew R.
    Reyes, Mauricio
    MEDICAL IMAGE ANALYSIS, 2022, 78
  • [45] Dual-stream Co-enhanced Network for Unsupervised Video Object Segmentation
    Zhu, Hongliang
    Yin, Hui
    Liu, Yanting
    Chen, Ning
    KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS, 2024, 18 (04): : 938 - 958
  • [46] Dual-Stream Pyramid Registration Network
    Hu, Xiaojun
    Kang, Miao
    Huang, Weilin
    Scott, Matthew R.
    Wiest, Roland
    Reyes, Mauricio
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2019, PT II, 2019, 11765 : 382 - 390
  • [47] Shots segmentation-based optimized dual-stream framework for robust human activity recognition in surveillance video
    Hussain, Altaf
    Khan, Samee Ullah
    Khan, Noman
    Ullah, Waseem
    Alkhayyat, Ahmed
    Alharbi, Meshal
    Baik, Sung Wook
    ALEXANDRIA ENGINEERING JOURNAL, 2024, 91 : 632 - 647
  • [48] Dual-Modal Tactile Perception and Exploration
    Pestell, Nicholas
    Lloyd, John
    Rossiter, Jonathan
    Lepora, Nathan F.
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2018, 3 (02): : 1033 - 1040
  • [49] Dual-Stream CNN-Transformer Network for Accurate Grasp Intention Recognition Based on sEMG and Finger Joint Angles
    Yu, Yue
    Yang, Jun
    Zhou, Zhixiong
    Meng, Wei
    Liu, Quan
    Zhou, Zude
    2024 INTERNATIONAL CONFERENCE ON ADVANCED ROBOTICS AND MECHATRONICS, ICARM 2024, 2024, : 507 - 512
  • [50] DSF-Net: Dual-Stream Fused Network for Video Frame Interpolation
    Zhang, Fuhua
    Yang, Chuang
    IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 1122 - 1126