Hierarchical Temporal Pooling for Efficient Online Action Recognition

被引：0

作者：

Zhang, Can ^{[1
]}

Zou, Yuexian ^{[1
,2
]}

Chen, Guang ^{[1
]}

机构：

[1] Peking Univ, Sch ECE, ADSPLAB, Shenzhen, Peoples R China

[2] Peng Cheng Lab, Shenzhen, Peoples R China

来源：

MULTIMEDIA MODELING (MMM 2019), PT I | 2019年 / 11295卷

关键词：

Action recognition; Hierarchical Temporal Pooling; Real-time;

D O I：

10.1007/978-3-030-05710-7_39

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Action recognition in videos is a difficult and challenging task. Recent developed deep learning-based action recognition methods have achieved the state-of-the-art performance on several action recognition benchmarks. However, it is noted that these methods are inefficient since they are of large model size and require long runtime which restrict their practical applications. In this study, we focus on improving the accuracy and efficiency of action recognition following the two-stream ConvNets by investigating the effective video-level representations. Our motivation stems from the observation that redundant information widely exists in adjacent frames in the videos and humans do not recognize actions based on frame-level features. Therefore, to extract the effective video-level features, a Hierarchical Temporal Pooling (HTP) module is proposed and a two-stream action recognition network termed as HTP-Net (Two-stream) is developed, which is carefully designed to obtain effective video-level representations by hierarchically incorporating the temporal motion and spatial appearance features. It is worth noting that all two-stream action recognition methods using optical flow as one of the inputs are computationally inefficient since calculating optical flow is time-consuming. To improve the efficiency, in our study, we do not consider using optical flow but consider only raw RGB as input to our HTP-Net termed as HTP-Net (RGB) for a clear and concise presentation. Extensive experiments have been conducted on two benchmarks: UCF101 and HMDB51. Experimental results demonstrate that HTP-Net (Two-stream) achieves the state-of-the-art performance and HTP-Net (RGB) offers competitive action recognition accuracy but is approximately 1-2 orders of magnitude faster than other state-of-the-art single stream action recognition methods. Specifically, our HTP-Net (RGB) runs at 42 videos per second (vps) and 672 frames per second (fps) on an NVIDIA Titan X GPU, which enables real-time action recognition and is of great value in practical applications.

引用

页码：471 / 482

页数：12

共 50 条

[41] MixTConv: Mixed Temporal Convolutional Kernels for Efficient Action Recognition
Shan, Kaiyu
Wang, Yongtao
Tang, Zhi
Chen, Ying
Li, Yangyan
2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 1751 - 1756
[42] Human action recognition using weighted pooling
Zhou, Wen
Wang, Chunheng
Xiao, Baihua
Zhang, Zhong
IET COMPUTER VISION, 2014, 8 (06) : 579 - 587
[43] A learnable motion preserving pooling for action recognition
Li, Tankun
Chan, Kwok Leung
Tjahjadi, Tardi
IMAGE AND VISION COMPUTING, 2024, 151
[44] Contextual Max Pooling for Human Action Recognition
Zhang, Zhong
Liu, Shuang
Mei, Xing
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2015, E98D (04) : 989 - 993
[45] Online spatio-temporal action detection with adaptive sampling and hierarchical modulation
Su, Shaowen
Gan, Minggang
MULTIMEDIA SYSTEMS, 2024, 30 (06)
[46] Mixed Resolution Network with hierarchical motion modeling for efficient action recognition
Lu, Xiusheng
Zhao, Sicheng
Cheng, Lechao
Zheng, Ying
Fan, Xueqiao
Song, Mingli
KNOWLEDGE-BASED SYSTEMS, 2024, 294
[47] Online Action Recognition
Suarez-Hernandez, Alejandro
Segovia-Aguas, Javier
Torras, Carme
Alenya, Guillem
THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 11981 - 11989
[48] Spatio-Temporal Motion Field Descriptors for The Hierarchical Action Recognition System
Bao, Ruihan
Shibata, Tadashi
5TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATION SYSTEMS, ICSPCS'2011, 2011,
[49] Online Hierarchical Linking of Action Tubes for Spatio-Temporal Action Detection Based on Multiple Clues
Su, Shaowen
Zhang, Yan
IEEE ACCESS, 2024, 12 : 54661 - 54672
[50] Beyond Temporal Pooling: Recurrence and Temporal Convolutions for Gesture Recognition in Video
Pigou, Lionel
van den Oord, Aaron
Dieleman, Sander
Van Herreweghe, Mieke
Dambre, Joni
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2018, 126 (2-4) : 430 - 439

← 1 2 3 4 5 →