Action Keypoint Network for Efficient Video Recognition

被引:3
|
作者
Chen, Xu [1 ,2 ]
Han, Yahong [1 ,2 ,3 ]
Wang, Xiaohan [4 ]
Sun, Yifan [5 ]
Yang, Yi [4 ]
机构
[1] Tianjin Univ, Coll Intelligence & Comp, Tianjin 300072, Peoples R China
[2] Tianjin Univ, Tianjin Key Lab Machine Learning, Tianjin 300072, Peoples R China
[3] Peng Cheng Lab, Shenzhen 518066, Peoples R China
[4] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou 310000, Peoples R China
[5] Baidu Res, Beijing 100000, Peoples R China
关键词
Video recognition; space-time interest points; deep learning; point cloud;
D O I
10.1109/TIP.2022.3191461
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Reducing redundancy is crucial for improving the efficiency of video recognition models. An effective approach is to select informative content from the holistic video, yielding a popular family of dynamic video recognition methods. However, existing dynamic methods focus on either temporal or spatial selection independently while neglecting a reality that the redundancies are usually spatial and temporal, simultaneously. Moreover, their selected content is usually cropped with fixed shapes (e.g., temporally-cropped frames, spatially-cropped patches), while the realistic distribution of informative content can be much more diverse. With these two insights, this paper proposes to integrate temporal and spatial selection into an Action Keypoint Network (AK-Net). From different frames and positions, AK-Net selects some informative points scattered in arbitrary-shaped regions as a set of "action keypoints" and then transforms the video recognition into point cloud classification. More concretely, AK-Net has two steps, i.e., the keypoint selection and the point cloud classification. First, it inputs the video into a baseline network and outputs a feature map from an intermediate layer. We view each pixel on this feature map as a spatial-temporal point and select some informative keypoints using self-attention. Second, AK-Net devises a ranking criterion to arrange the keypoints into an ordered 1D sequence. Since the video is represented with a 1D sequence after the specified layer, AK-Net transforms the subsequent layers into a point cloud classification sub-net by compacting the original 2D convolutional kernels into 1D kernels. Consequentially, AK-Net brings two-fold benefits for efficiency: The keypoint selection step collects informative content within arbitrary shapes and increases the efficiency for modeling spatial-temporal dependencies, while the point cloud classification step further reduces the computational cost by compacting the convolutional kernels. Experimental results show that AK-Net can consistently improve the efficiency and performance of baseline methods on several video recognition benchmarks.
引用
收藏
页码:4980 / 4993
页数:14
相关论文
共 50 条
  • [21] Sparse Dense Transformer Network for Video Action Recognition
    Qu, Xiaochun
    Zhang, Zheyuan
    Xiao, Wei
    Ran, Jinye
    Wang, Guodong
    Zhang, Zili
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, PT II, 2022, 13369 : 43 - 56
  • [22] Badminton video action recognition based on time network
    Zhi, Juncai
    Sun, Zijie
    Zhang, Ruijie
    Zhao, Zhouxiang
    JOURNAL OF COMPUTATIONAL METHODS IN SCIENCES AND ENGINEERING, 2023, 23 (05) : 2739 - 2752
  • [23] Motion Complementary Network for Efficient Action Recognition
    Cheng, Ke
    Zhang, Yifan
    Li, Chenghua
    Cheng, Jian
    Lu, Hanqing
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 1543 - 1549
  • [24] Unified Keypoint-based Action Recognition Framework via Structured Keypoint Pooling
    Hachiume, Ryo
    Sato, Fumiaki
    Sekii, Taiki
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 22962 - 22971
  • [25] Accurate Grid Keypoint Learning for Efficient Video Prediction
    Gao, Xiaojie
    Jin, Yueming
    Dou, Qi
    Fu, Chi-Wing
    Heng, Pheng-Ann
    2021 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2021, : 5908 - 5915
  • [26] Temporal Saliency Query Network for Efficient Video Recognition
    Xia, Boyang
    Wang, Zhihao
    Wu, Wenhao
    Wang, Haoran
    Han, Jungong
    COMPUTER VISION, ECCV 2022, PT XXXIV, 2022, 13694 : 741 - 759
  • [27] Towards efficient video-based action recognition: context-aware memory attention network
    Koh, Thean Chun
    Yeo, Chai Kiat
    Jing, Xuan
    Sivadas, Sunil
    SN APPLIED SCIENCES, 2023, 5 (12):
  • [28] Towards efficient video-based action recognition: context-aware memory attention network
    Thean Chun Koh
    Chai Kiat Yeo
    Xuan Jing
    Sunil Sivadas
    SN Applied Sciences, 2023, 5
  • [29] An efficient motion visual learning method for video action recognition
    Wang, Bin
    Chang, Faliang
    Liu, Chunsheng
    Wang, Wenqian
    Ma, Ruiyi
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 255
  • [30] Efficient dual attention SlowFast networks for video action recognition
    Wei, Dafeng
    Tian, Ye
    Wei, Liqing
    Zhong, Hong
    Chen, Siqian
    Pu, Shiliang
    Lu, Hongtao
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2022, 222