Hierarchical Temporal Pooling for Efficient Online Action Recognition

被引：0

作者：

Zhang, Can ^{[1
]}

Zou, Yuexian ^{[1
,2
]}

Chen, Guang ^{[1
]}

机构：

[1] Peking Univ, Sch ECE, ADSPLAB, Shenzhen, Peoples R China

[2] Peng Cheng Lab, Shenzhen, Peoples R China

来源：

MULTIMEDIA MODELING (MMM 2019), PT I | 2019年 / 11295卷

关键词：

Action recognition; Hierarchical Temporal Pooling; Real-time;

D O I：

10.1007/978-3-030-05710-7_39

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Action recognition in videos is a difficult and challenging task. Recent developed deep learning-based action recognition methods have achieved the state-of-the-art performance on several action recognition benchmarks. However, it is noted that these methods are inefficient since they are of large model size and require long runtime which restrict their practical applications. In this study, we focus on improving the accuracy and efficiency of action recognition following the two-stream ConvNets by investigating the effective video-level representations. Our motivation stems from the observation that redundant information widely exists in adjacent frames in the videos and humans do not recognize actions based on frame-level features. Therefore, to extract the effective video-level features, a Hierarchical Temporal Pooling (HTP) module is proposed and a two-stream action recognition network termed as HTP-Net (Two-stream) is developed, which is carefully designed to obtain effective video-level representations by hierarchically incorporating the temporal motion and spatial appearance features. It is worth noting that all two-stream action recognition methods using optical flow as one of the inputs are computationally inefficient since calculating optical flow is time-consuming. To improve the efficiency, in our study, we do not consider using optical flow but consider only raw RGB as input to our HTP-Net termed as HTP-Net (RGB) for a clear and concise presentation. Extensive experiments have been conducted on two benchmarks: UCF101 and HMDB51. Experimental results demonstrate that HTP-Net (Two-stream) achieves the state-of-the-art performance and HTP-Net (RGB) offers competitive action recognition accuracy but is approximately 1-2 orders of magnitude faster than other state-of-the-art single stream action recognition methods. Specifically, our HTP-Net (RGB) runs at 42 videos per second (vps) and 672 frames per second (fps) on an NVIDIA Titan X GPU, which enables real-time action recognition and is of great value in practical applications.

引用

页码：471 / 482

页数：12

共 50 条

[21] Efficient local filter bank with over complete spatiotemporal pooling in action recognition
Li, Yawei
Jin, Lizuo
Jie, Feiran
Sun, Changyin
2013 32ND CHINESE CONTROL CONFERENCE (CCC), 2013, : 3750 - 3755
[22] FAST ONLINE ACTION RECOGNITION WITH EFFICIENT STRUCTURED BOOSTING
Shimosaka, Masamichi
Nejigane, Yu
Mori, Taketoshi
Sato, Tomomasa
ICME: 2009 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOLS 1-3, 2009, : 706 - 709
[23] Discriminative Hierarchical Rank Pooling for Activity Recognition
Fernando, Basura
Anderson, Peter
Hutter, Marcus
Gould, Stephen
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 1924 - 1932
[24] Generalized Max Pooling for Action Recognition
Trang Nguyen
Sang Phan
Thanh Duc Ngo
2015 SEVENTH INTERNATIONAL CONFERENCE ON KNOWLEDGE AND SYSTEMS ENGINEERING (KSE), 2015, : 401 - 406
[25] First-Person Action Recognition With Temporal Pooling and Hilbert-Huang Transform
Purwanto, Didik
Chen, Yie-Tarng
Fang, Wen-Hsien
IEEE TRANSACTIONS ON MULTIMEDIA, 2019, 21 (12) : 3122 - 3135
[26] Efficient spatio-temporal network for action recognition
Su, Yanxiong
Zhao, Qian
JOURNAL OF REAL-TIME IMAGE PROCESSING, 2024, 21 (05)
[27] TDN: Temporal Difference Networks for Efficient Action Recognition
Wang, Limin
Tong, Zhan
Ji, Bin
Wu, Gangshan
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 1895 - 1904
[28] Asynchronous Joint-Based Temporal Pooling for Skeleton-Based Action Recognition
Gunasekara, Shanaka Ramesh
Li, Wanqing
Yang, Jack
Ogunbona, Philip O.
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2025, 35 (01) : 357 - 366
[29] Temporal Hierarchical Dictionary Guided Decoding for Online Gesture Segmentation and Recognition
Chen, Haoyu
Liu, Xin
Shi, Jingang
Zhao, Guoying
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 : 9689 - 9702
[30] Clustered Spatio-Temporal Manifolds for Online Action Recognition
Bloom, Victoria
Makris, Dimitrios
Argyriou, Vasileios
2014 22ND INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2014, : 3963 - 3968

← 1 2 3 4 5 →