Hierarchical Temporal Pooling for Efficient Online Action Recognition

被引:0
|
作者
Zhang, Can [1 ]
Zou, Yuexian [1 ,2 ]
Chen, Guang [1 ]
机构
[1] Peking Univ, Sch ECE, ADSPLAB, Shenzhen, Peoples R China
[2] Peng Cheng Lab, Shenzhen, Peoples R China
来源
MULTIMEDIA MODELING (MMM 2019), PT I | 2019年 / 11295卷
关键词
Action recognition; Hierarchical Temporal Pooling; Real-time;
D O I
10.1007/978-3-030-05710-7_39
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Action recognition in videos is a difficult and challenging task. Recent developed deep learning-based action recognition methods have achieved the state-of-the-art performance on several action recognition benchmarks. However, it is noted that these methods are inefficient since they are of large model size and require long runtime which restrict their practical applications. In this study, we focus on improving the accuracy and efficiency of action recognition following the two-stream ConvNets by investigating the effective video-level representations. Our motivation stems from the observation that redundant information widely exists in adjacent frames in the videos and humans do not recognize actions based on frame-level features. Therefore, to extract the effective video-level features, a Hierarchical Temporal Pooling (HTP) module is proposed and a two-stream action recognition network termed as HTP-Net (Two-stream) is developed, which is carefully designed to obtain effective video-level representations by hierarchically incorporating the temporal motion and spatial appearance features. It is worth noting that all two-stream action recognition methods using optical flow as one of the inputs are computationally inefficient since calculating optical flow is time-consuming. To improve the efficiency, in our study, we do not consider using optical flow but consider only raw RGB as input to our HTP-Net termed as HTP-Net (RGB) for a clear and concise presentation. Extensive experiments have been conducted on two benchmarks: UCF101 and HMDB51. Experimental results demonstrate that HTP-Net (Two-stream) achieves the state-of-the-art performance and HTP-Net (RGB) offers competitive action recognition accuracy but is approximately 1-2 orders of magnitude faster than other state-of-the-art single stream action recognition methods. Specifically, our HTP-Net (RGB) runs at 42 videos per second (vps) and 672 frames per second (fps) on an NVIDIA Titan X GPU, which enables real-time action recognition and is of great value in practical applications.
引用
收藏
页码:471 / 482
页数:12
相关论文
共 50 条
  • [41] MixTConv: Mixed Temporal Convolutional Kernels for Efficient Action Recognition
    Shan, Kaiyu
    Wang, Yongtao
    Tang, Zhi
    Chen, Ying
    Li, Yangyan
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 1751 - 1756
  • [42] Human action recognition using weighted pooling
    Zhou, Wen
    Wang, Chunheng
    Xiao, Baihua
    Zhang, Zhong
    IET COMPUTER VISION, 2014, 8 (06) : 579 - 587
  • [43] A learnable motion preserving pooling for action recognition
    Li, Tankun
    Chan, Kwok Leung
    Tjahjadi, Tardi
    IMAGE AND VISION COMPUTING, 2024, 151
  • [44] Contextual Max Pooling for Human Action Recognition
    Zhang, Zhong
    Liu, Shuang
    Mei, Xing
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2015, E98D (04) : 989 - 993
  • [45] Online spatio-temporal action detection with adaptive sampling and hierarchical modulation
    Su, Shaowen
    Gan, Minggang
    MULTIMEDIA SYSTEMS, 2024, 30 (06)
  • [46] Mixed Resolution Network with hierarchical motion modeling for efficient action recognition
    Lu, Xiusheng
    Zhao, Sicheng
    Cheng, Lechao
    Zheng, Ying
    Fan, Xueqiao
    Song, Mingli
    KNOWLEDGE-BASED SYSTEMS, 2024, 294
  • [47] Online Action Recognition
    Suarez-Hernandez, Alejandro
    Segovia-Aguas, Javier
    Torras, Carme
    Alenya, Guillem
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 11981 - 11989
  • [48] Spatio-Temporal Motion Field Descriptors for The Hierarchical Action Recognition System
    Bao, Ruihan
    Shibata, Tadashi
    5TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATION SYSTEMS, ICSPCS'2011, 2011,
  • [49] Online Hierarchical Linking of Action Tubes for Spatio-Temporal Action Detection Based on Multiple Clues
    Su, Shaowen
    Zhang, Yan
    IEEE ACCESS, 2024, 12 : 54661 - 54672
  • [50] Beyond Temporal Pooling: Recurrence and Temporal Convolutions for Gesture Recognition in Video
    Pigou, Lionel
    van den Oord, Aaron
    Dieleman, Sander
    Van Herreweghe, Mieke
    Dambre, Joni
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2018, 126 (2-4) : 430 - 439