Video Jigsaw: Unsupervised Learning of Spatiotemporal Context for Video Action Recognition

被引:73
|
作者
Ahsan, Unaiza [1 ]
Madhok, Rishi [2 ]
Essa, Irfan [1 ]
机构
[1] Georgia Inst Technol, Atlanta, GA 30332 USA
[2] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
关键词
D O I
10.1109/WACV.2019.00025
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
We propose a self-supervised learning method to jointly reason about spatial and temporal context for video recognition. Recent self-supervised approaches have used spatial context [9, 34] as well as temporal coherency [32] but a combination of the two requires extensive preprocessing such as tracking objects through millions of video frames [59] or computing optical flow to determine frame regions with high motion [30]. We propose to combine spatial and temporal context in one self-supervised framework without any heavy preprocessing. We divide multiple video frames into grids of patches and train a network to solve jigsaw puzzles on these patches from multiple frames. So the network is trained to correctly identify the position of a patch within a video frame as well as the position of a patch over time. We also propose a novel permutation strategy that outperforms random permutations while significantly reducing computational and memory constraints. We use our trained network for transfer learning tasks such as video activity recognition and demonstrate the strength of our approach on two benchmark video action recognition datasets without using a single frame from these datasets for unsupervised pretraining of our proposed video jigsaw network.
引用
收藏
页码:179 / 189
页数:11
相关论文
共 50 条
  • [1] Collaborative Spatiotemporal Feature Learning for Video Action Recognition
    Li, Chao
    Zhong, Qiaoyong
    Xie, Di
    Pu, Shiliang
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 7864 - 7873
  • [2] Spatiotemporal Saliency Representation Learning for Video Action Recognition
    Kong, Yongqiang
    Wang, Yunhong
    Li, Annan
    IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 1515 - 1528
  • [3] Spatiotemporal Residual Networks for Video Action Recognition
    Feichtenhofer, Christoph
    Pinz, Axel
    Wildes, Richard P.
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 29 (NIPS 2016), 2016, 29
  • [4] Spatiotemporal Fusion Networks for Video Action Recognition
    Liu, Zheng
    Hu, Haifeng
    Zhang, Junxuan
    NEURAL PROCESSING LETTERS, 2019, 50 (02) : 1877 - 1890
  • [5] Spatiotemporal Pyramid Network for Video Action Recognition
    Wang, Yunbo
    Long, Mingsheng
    Wang, Jianmin
    Yu, Philip S.
    30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 2097 - 2106
  • [6] Spatiotemporal Fusion Networks for Video Action Recognition
    Zheng Liu
    Haifeng Hu
    Junxuan Zhang
    Neural Processing Letters, 2019, 50 : 1877 - 1890
  • [7] Spatiotemporal Relation Networks for Video Action Recognition
    Liu, Zheng
    Hu, Haifeng
    IEEE ACCESS, 2019, 7 : 14969 - 14976
  • [8] Spatiotemporal Multiplier Networks for Video Action Recognition
    Feichtenhofer, Christoph
    Pinz, Axel
    Wildes, Richard P.
    30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 7445 - 7454
  • [9] Self-Supervised Video Representation Learning with Constrained Spatiotemporal Jigsaw
    Huo, Yuqi
    Ding, Mingyu
    Lu, Haoyu
    Huang, Ziyuan
    Tang, Mingqian
    Lu, Zhiwu
    Xiang, Tao
    PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021, 2021, : 751 - 757
  • [10] Unsupervised Deep Learning of Mid-Level Video Representation for Action Recognition
    Hou, Jingyi
    Wu, Xinxiao
    Chen, Jin
    Luo, Jiebo
    Jia, Yunde
    THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 6910 - 6917