Video Jigsaw: Unsupervised Learning of Spatiotemporal Context for Video Action Recognition

被引：73

作者：

Ahsan, Unaiza ^{[1
]}

Madhok, Rishi ^{[2
]}

Essa, Irfan ^{[1
]}

机构：

[1] Georgia Inst Technol, Atlanta, GA 30332 USA

[2] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA

来源：

2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV) | 2019年

关键词：

D O I：

10.1109/WACV.2019.00025

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

We propose a self-supervised learning method to jointly reason about spatial and temporal context for video recognition. Recent self-supervised approaches have used spatial context [9, 34] as well as temporal coherency [32] but a combination of the two requires extensive preprocessing such as tracking objects through millions of video frames [59] or computing optical flow to determine frame regions with high motion [30]. We propose to combine spatial and temporal context in one self-supervised framework without any heavy preprocessing. We divide multiple video frames into grids of patches and train a network to solve jigsaw puzzles on these patches from multiple frames. So the network is trained to correctly identify the position of a patch within a video frame as well as the position of a patch over time. We also propose a novel permutation strategy that outperforms random permutations while significantly reducing computational and memory constraints. We use our trained network for transfer learning tasks such as video activity recognition and demonstrate the strength of our approach on two benchmark video action recognition datasets without using a single frame from these datasets for unsupervised pretraining of our proposed video jigsaw network.

引用

页码：179 / 189

页数：11

共 50 条

[1] Collaborative Spatiotemporal Feature Learning for Video Action Recognition
Li, Chao
Zhong, Qiaoyong
Xie, Di
Pu, Shiliang
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 7864 - 7873
[2] Spatiotemporal Saliency Representation Learning for Video Action Recognition
Kong, Yongqiang
Wang, Yunhong
Li, Annan
IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 1515 - 1528
[3] Spatiotemporal Residual Networks for Video Action Recognition
Feichtenhofer, Christoph
Pinz, Axel
Wildes, Richard P.
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 29 (NIPS 2016), 2016, 29
[4] Spatiotemporal Fusion Networks for Video Action Recognition
Liu, Zheng
Hu, Haifeng
Zhang, Junxuan
NEURAL PROCESSING LETTERS, 2019, 50 (02) : 1877 - 1890
[5] Spatiotemporal Pyramid Network for Video Action Recognition
Wang, Yunbo
Long, Mingsheng
Wang, Jianmin
Yu, Philip S.
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 2097 - 2106
[6] Spatiotemporal Fusion Networks for Video Action Recognition
Zheng Liu
Haifeng Hu
Junxuan Zhang
Neural Processing Letters, 2019, 50 : 1877 - 1890
[7] Spatiotemporal Relation Networks for Video Action Recognition
Liu, Zheng
Hu, Haifeng
IEEE ACCESS, 2019, 7 : 14969 - 14976
[8] Spatiotemporal Multiplier Networks for Video Action Recognition
Feichtenhofer, Christoph
Pinz, Axel
Wildes, Richard P.
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 7445 - 7454
[9] Self-Supervised Video Representation Learning with Constrained Spatiotemporal Jigsaw
Huo, Yuqi
Ding, Mingyu
Lu, Haoyu
Huang, Ziyuan
Tang, Mingqian
Lu, Zhiwu
Xiang, Tao
PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021, 2021, : 751 - 757
[10] Unsupervised Deep Learning of Mid-Level Video Representation for Action Recognition
Hou, Jingyi
Wu, Xinxiao
Chen, Jin
Luo, Jiebo
Jia, Yunde
THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 6910 - 6917

← 1 2 3 4 5 →