Learning Spatio-Temporal Representation with Local and Global Diffusion

被引:107
|
作者
Qiu, Zhaofan [1 ]
Yao, Ting [2 ]
Ngo, Chong-Wah [3 ]
Tian, Xinmei [1 ]
Mei, Tao [2 ]
机构
[1] Univ Sci & Technol China, Hefei, Peoples R China
[2] JD AI Res, Beijing, Peoples R China
[3] City Univ Hong Kong, Kowloon, Hong Kong, Peoples R China
关键词
D O I
10.1109/CVPR.2019.01233
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Convolutional Neural Networks (CNN) have been regarded as a powerful class of models for visual recognition problems. Nevertheless, the convolutional filters in these networks are local operations while ignoring the large-range dependency. Such drawback becomes even worse particularly for video recognition, since video is an information-intensive media with complex temporal variations. In this paper, we present a novel framework to boost the spatio-temporal representation learning by Local and Global Diffusion (LGD). Specifically, we construct a novel neural network architecture that learns the local and global representations in parallel. The architecture is composed of LGD blocks, where each block updates local and global features by modeling the diffusions between these two representations. Diffusions effectively interact two aspects of information, i.e., localized and holistic, for more powerful way of representation learning. Furthermore, a kernelized classifier is introduced to combine the representations from two aspects for video recognition. Our LGD networks achieve clear improvements on the large-scale Kinetics-400 and Kinetics-600 video classification datasets against the best competitors by 3.5% and 0.7%. We further examine the generalization of both the global and local representations produced by our pre-trained LGD networks on four different benchmarks for video action recognition and spatio-temporal action detection tasks. Superior performances over several state-of-the-art techniques on these benchmarks are reported.
引用
收藏
页码:12048 / 12057
页数:10
相关论文
共 50 条
  • [41] Local descriptors for spatio-temporal recognition
    Laptev, Ivan
    Lindeberg, Tony
    SPATIAL COHERENCE FOR VISUAL MOTION ANALYSIS, 2006, 3667 : 91 - 103
  • [42] Learning spatio-temporal dependency of local patches for complex motion segmentation
    Xu, Jiang
    Yuan, Junsong
    Wu, Ying
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2011, 115 (03) : 334 - 351
  • [43] Measuring disentangled generative spatio-temporal representation
    Zhao, Sichen
    Shao, Wei
    Chan, Jeffrey
    Salim, Flora D.
    PROCEEDINGS OF THE 2022 SIAM INTERNATIONAL CONFERENCE ON DATA MINING, SDM, 2022, : 522 - 530
  • [44] Visual processing and representation of spatio-temporal patterns
    Eisenkolb, A
    Schill, K
    Röhrbein, F
    Baier, V
    Musto, A
    Brauer, W
    SPATIAL COGNITION II: INTEGRATING ABSTRACT THEORIES, EMPIRICAL STUDIES, FORMAL METHODS, AND PRACTICAL APPLICATIONS, 2000, 1849 : 145 - 156
  • [45] Hierarchical Representation of Videos with Spatio-Temporal Fibers
    Kumar, Ratnesh
    Charpiat, Guillaume
    Thonnat, Monique
    2014 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2014, : 469 - 476
  • [46] A spatio-temporal modeling method for shape representation
    Huane, Hena
    Shen, Li
    Zhang, Rong
    Makedon, Fillia
    Pearlman, Justin
    THIRD INTERNATIONAL SYMPOSIUM ON 3D DATA PROCESSING, VISUALIZATION, AND TRANSMISSION, PROCEEDINGS, 2007, : 1034 - 1040
  • [47] Visual processing and representation of spatio-temporal patterns
    Eisenkolb, Andreas
    Schill, Kerstin
    Röhrbein, Florian
    Baier, Volker
    Musto, Alexandra
    Brauer, Wilfried
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2000, 1849 : 145 - 156
  • [48] Zone-Enhanced Spatio-Temporal Representation Learning for Urban POI Recommendation
    Wang, En
    Xu, Yuanbo
    Yang, Yongjian
    Jiang, Yiheng
    Yang, Fukang
    Wu, Jie
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (09) : 9628 - 9641
  • [49] Learning expectation in insects: A recurrent spiking neural model for spatio-temporal representation
    Arena, Paolo
    Patane, Luca
    Termini, Pietro Savio
    NEURAL NETWORKS, 2012, 32 : 35 - 45
  • [50] Sparse Representation With Spatio-Temporal Online Dictionary Learning for Promising Video Coding
    Dai, Wenrui
    Shen, Yangmei
    Tang, Xin
    Zou, Junni
    Xiong, Hongkai
    Chen, Chang Wen
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2016, 25 (10) : 4580 - 4595