Video Jigsaw: Unsupervised Learning of Spatiotemporal Context for Video Action Recognition

被引：73

作者：

Ahsan, Unaiza ^{[1
]}

Madhok, Rishi ^{[2
]}

Essa, Irfan ^{[1
]}

机构：

[1] Georgia Inst Technol, Atlanta, GA 30332 USA

[2] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA

来源：

2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV) | 2019年

关键词：

D O I：

10.1109/WACV.2019.00025

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

We propose a self-supervised learning method to jointly reason about spatial and temporal context for video recognition. Recent self-supervised approaches have used spatial context [9, 34] as well as temporal coherency [32] but a combination of the two requires extensive preprocessing such as tracking objects through millions of video frames [59] or computing optical flow to determine frame regions with high motion [30]. We propose to combine spatial and temporal context in one self-supervised framework without any heavy preprocessing. We divide multiple video frames into grids of patches and train a network to solve jigsaw puzzles on these patches from multiple frames. So the network is trained to correctly identify the position of a patch within a video frame as well as the position of a patch over time. We also propose a novel permutation strategy that outperforms random permutations while significantly reducing computational and memory constraints. We use our trained network for transfer learning tasks such as video activity recognition and demonstrate the strength of our approach on two benchmark video action recognition datasets without using a single frame from these datasets for unsupervised pretraining of our proposed video jigsaw network.

引用

页码：179 / 189

页数：11

共 50 条

[21] Imperceptible Adversarial Attack With Multigranular Spatiotemporal Attention for Video Action Recognition
Wu, Guoming
Xu, Yangfan
Li, Jun
Shi, Zhiping
Liu, Xianglong
IEEE INTERNET OF THINGS JOURNAL, 2023, 10 (20) : 17785 - 17796
[22] Video spatiotemporal mapping for human action recognition by convolutional neural network
Amin Zare
Hamid Abrishami Moghaddam
Arash Sharifi
Pattern Analysis and Applications, 2020, 23 : 265 - 279
[23] Spatiotemporal distilled dense-connectivity network for video action recognition
Hao, Wangli
Zhang, Zhaoxiang
PATTERN RECOGNITION, 2019, 92 : 13 - 24
[24] Multi-scale spatiotemporal normality learning for unsupervised video anomaly detection
Liu, Caitian
Gong, Linxiao
Chen, Xiong
APPLIED INTELLIGENCE, 2025, 55 (07)
[25] Learning and Association of Features for Action Recognition in Streaming Video
Nair, Binu M.
Asari, Vijayan K.
ADVANCES IN VISUAL COMPUTING (ISVC 2014), PT II, 2014, 8888 : 642 - 651
[26] Learning joints relation graphs for video action recognition
Liu, Xiaodong
Xu, Huating
Wang, Miao
Frontiers in Neurorobotics, 2022, 16
[27] A BASELINE ON CONTINUAL LEARNING METHODS FOR VIDEO ACTION RECOGNITION
Castagnolo, Giulia
Spampinato, Concetto
Rundo, Francesco
Giordano, Daniela
Palazzo, Simone
2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 3240 - 3244
[28] Learning joints relation graphs for video action recognition
Liu, Xiaodong
Xu, Huating
Wang, Miao
FRONTIERS IN NEUROROBOTICS, 2022, 16
[29] VIDEO ANOMALY DETECTION IN SPATIOTEMPORAL CONTEXT
Jiang, Fan
Yuan, Junsong
Tsaftaris, Sotirios A.
Katsaggelos, Aggelos K.
2010 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, 2010, : 705 - 708
[30] Few-Shot Learning of Video Action Recognition Only Based on Video Contents
Bo, Yang
Lu, Yangdi
He, Wenbo
2020 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2020, : 584 - 593

← 1 2 3 4 5 →