Compressed Video Action Recognition With Dual-Stream and Dual-Modal Transformer

被引：4

作者：

Mou, Yuting ^{[1
]}

Jiang, Xinghao ^{[1
]}

Xu, Ke ^{[1
]}

Sun, Tanfeng ^{[1
]}

Wang, Zepeng ^{[1
]}

机构：

[1] Shanghai Jiao Tong Univ, Sch Elect Informat & Elect Engn, Natl Engn Lab Informat Content Anal Tech, Shanghai 200240, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2024年 / 34卷 / 05期

基金：

中国国家自然科学基金;

关键词：

Compressed video; action recognition; NETWORK; EFFICIENCY;

D O I：

10.1109/TCSVT.2023.3319140

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Compressed video action recognition offers the advantage of reducing decoding and inference time compared to the RGB domain. However, the compressed domain poses unique challenges with different types of frames (I-frames and P-frames). I-frames consistent with RGB are rich in frame information, but the redundant information may interfere with the recognition task. There are two modalities in P-frames, residual (R) and motion vector (MV). Although with less information, they can reflect the motion cue. To address these challenges and leverage the independent information from different frames and modalities, we propose a novel approach called Dual-Stream and Dual-Modal Transformer (DSDMT). Our approach consists of two streams: 1) The short-span P-frames stream contains temporal information. We propose the Dual-Modal Attention Module (DAM) to mine different modal variability in P-frames and complement the orthogonal feature vector. Besides, considering the sparsity of P-frames, we extract action features with Frame-level Patch Embedding (FPE) to avoid redundant computation. 2) The long-span I-frames stream extracts the global context feature of the entire video, including content and scene information. By fusing the global video context and local key-frame features, our model represents the action feature in terms of fine-grained and coarse-grained. We evaluated our proposed DSDMT on three public benchmarks with different scales: HMDB-51, UCF-101, and Kinetics-400. Ours achieve better performance with fewer Flops and lower latency. Our analysis shows that the independence and complements of the I-frames and P-frames extracted from the compressed video stream play a crucial role in action recognition.

引用

页码：3299 / 3312

页数：14

共 50 条

[21] A Dual-Stream Transformer With Diff-Attention for Multispectral and Panchromatic Classification
Xu, Lin
Zhu, Hao
Jiao, Licheng
Zhao, Wenhao
Li, Xiaotong
Hou, Biao
Ren, Zhongle
Ma, Wenping
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61 : 1 - 14
[22] A DUAL-STREAM NEUROANATOMY OF SINGING
Loui, Psyche
MUSIC PERCEPTION, 2015, 32 (03): : 232 - 241
[23] Dual-stream VO: Visual Odometry Based on LSTM Dual-Stream Convolutional Neural Network
Luo, Yuan
Zeng, YongChao
Lv, RunZhe
Wang, WenHao
ENGINEERING LETTERS, 2022, 30 (03) : 926 - 934
[24] Video salient object detection using dual-stream spatiotemporal attention
Xu, Chenchu
Gao, Zhifan
Zhang, Heye
Li, Shuo
de Albuquerque, Victor Hugo C.
APPLIED SOFT COMPUTING, 2021, 108
[25] The motion vector reuse algorithm to improve dual-stream video encoder
Zhou, Hong
Zhou, Jingli
Xia, Xiaojian
ICSP: 2008 9TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING, VOLS 1-5, PROCEEDINGS, 2008, : 1284 - 1287
[26] Bayesian Cellular Automata Fusion Model Based on Dual-Stream Strategy for Video Anomaly Action Detection
Zhao, Zhongtang
Li, Ruixian
PATTERN RECOGNITION AND IMAGE ANALYSIS, 2021, 31 (04) : 688 - 698
[27] Bayesian Cellular Automata Fusion Model Based on Dual-Stream Strategy for Video Anomaly Action Detection
Pattern Recognition and Image Analysis, 2021, 31 : 688 - 698
[28] Dual-Stream Knowledge-Preserving Hashing for Unsupervised Video Retrieval
Li, Pandeng
Xie, Hongtao
Ge, Jiannan
Zhang, Lei
Min, Shaobo
Zhang, Yongdong
COMPUTER VISION - ECCV 2022, PT XIV, 2022, 13674 : 181 - 197
[29] Dual-stream spatio-temporal decoupling network for video deblurring
Ning, Taigong
Li, Weihong
Li, Zhenghao
Zhang, Yanfang
APPLIED SOFT COMPUTING, 2022, 116
[30] Dual-stream cross-modal fusion alignment network for survival analysis
Song, Jinmiao
Hao, Yatong
Zhao, Shuang
Zhang, Peng
Feng, Qilin
Dai, Qiguo
Duan, Xiaodong
BRIEFINGS IN BIOINFORMATICS, 2025, 26 (02)

← 1 2 3 4 5 →