Action Keypoint Network for Efficient Video Recognition

被引:3
|
作者
Chen, Xu [1 ,2 ]
Han, Yahong [1 ,2 ,3 ]
Wang, Xiaohan [4 ]
Sun, Yifan [5 ]
Yang, Yi [4 ]
机构
[1] Tianjin Univ, Coll Intelligence & Comp, Tianjin 300072, Peoples R China
[2] Tianjin Univ, Tianjin Key Lab Machine Learning, Tianjin 300072, Peoples R China
[3] Peng Cheng Lab, Shenzhen 518066, Peoples R China
[4] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou 310000, Peoples R China
[5] Baidu Res, Beijing 100000, Peoples R China
关键词
Video recognition; space-time interest points; deep learning; point cloud;
D O I
10.1109/TIP.2022.3191461
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Reducing redundancy is crucial for improving the efficiency of video recognition models. An effective approach is to select informative content from the holistic video, yielding a popular family of dynamic video recognition methods. However, existing dynamic methods focus on either temporal or spatial selection independently while neglecting a reality that the redundancies are usually spatial and temporal, simultaneously. Moreover, their selected content is usually cropped with fixed shapes (e.g., temporally-cropped frames, spatially-cropped patches), while the realistic distribution of informative content can be much more diverse. With these two insights, this paper proposes to integrate temporal and spatial selection into an Action Keypoint Network (AK-Net). From different frames and positions, AK-Net selects some informative points scattered in arbitrary-shaped regions as a set of "action keypoints" and then transforms the video recognition into point cloud classification. More concretely, AK-Net has two steps, i.e., the keypoint selection and the point cloud classification. First, it inputs the video into a baseline network and outputs a feature map from an intermediate layer. We view each pixel on this feature map as a spatial-temporal point and select some informative keypoints using self-attention. Second, AK-Net devises a ranking criterion to arrange the keypoints into an ordered 1D sequence. Since the video is represented with a 1D sequence after the specified layer, AK-Net transforms the subsequent layers into a point cloud classification sub-net by compacting the original 2D convolutional kernels into 1D kernels. Consequentially, AK-Net brings two-fold benefits for efficiency: The keypoint selection step collects informative content within arbitrary shapes and increases the efficiency for modeling spatial-temporal dependencies, while the point cloud classification step further reduces the computational cost by compacting the convolutional kernels. Experimental results show that AK-Net can consistently improve the efficiency and performance of baseline methods on several video recognition benchmarks.
引用
收藏
页码:4980 / 4993
页数:14
相关论文
共 50 条
  • [31] Dynamic Spatial Focus for Efficient Compressed Video Action Recognition
    Zheng, Ziwei
    Yang, Le
    Wang, Yulin
    Zhang, Miao
    He, Lijun
    Huang, Gao
    Li, Fan
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (02) : 695 - 708
  • [32] TWO-PATHWAY TRANSFORMER NETWORK FOR VIDEO ACTION RECOGNITION
    Jiang, Bo
    Yu, Jiahong
    Zhou, Lei
    Wu, Kailin
    Yang, Yang
    2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, : 1089 - 1093
  • [33] Manet: motion-aware network for video action recognition
    Li, Xiaoyang
    Yang, Wenzhu
    Wang, Kanglin
    Wang, Tiebiao
    Zhang, Chen
    COMPLEX & INTELLIGENT SYSTEMS, 2025, 11 (03)
  • [34] Multi-Kernel Excitation Network for Video Action Recognition
    Tian, Qingze
    Wang, Kun
    Liu, Baodi
    Wang, Yanjiang
    2022 16TH IEEE INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING (ICSP2022), VOL 1, 2022, : 155 - 159
  • [35] Multipath Attention and Adaptive Gating Network for Video Action Recognition
    Haiping Zhang
    Zepeng Hu
    Dongjin Yu
    Liming Guan
    Xu Liu
    Conghao Ma
    Neural Processing Letters, 56
  • [36] SCN: Dilated silhouette convolutional network for video action recognition
    Hua, Michelle
    Gao, Mingqi
    Zhong, Zichun
    COMPUTER AIDED GEOMETRIC DESIGN, 2021, 85
  • [37] A Multi-Scale Video Longformer Network for Action Recognition
    Chen, Congping
    Zhang, Chunsheng
    Dong, Xin
    APPLIED SCIENCES-BASEL, 2024, 14 (03):
  • [38] Multipath Attention and Adaptive Gating Network for Video Action Recognition
    Zhang, Haiping
    Hu, Zepeng
    Yu, Dongjin
    Guan, Liming
    Liu, Xu
    Ma, Conghao
    NEURAL PROCESSING LETTERS, 2024, 56 (02)
  • [39] SDAN: Stacked Diverse Attention Network for Video Action Recognition
    Zhu, Xiaoguang
    Huang, Siran
    Fan, Wenjing
    Cheng, Yuhao
    Shao, Huaqing
    Liu, Peilin
    2021 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), 2021,
  • [40] ATTENTIONAL FUSED TEMPORAL TRANSFORMATION NETWORK FOR VIDEO ACTION RECOGNITION
    Yang, Ke
    Wang, Zhiyuan
    Dai, Huadong
    Shen, Tianlong
    Qiao, Peng
    Niu, Xin
    Jiang, Jie
    Li, Dongsheng
    Dou, Yong
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 4377 - 4381