Spatiotemporal distilled dense-connectivity network for video action recognition

被引：41

作者：

Hao, Wangli ^{[1
,3
]}

Zhang, Zhaoxiang ^{[1
,2
,3
]}

机构：

[1] Chinese Acad Sci CASIA Beijing, Inst Automat, CRIPAC, NLPR, Beijing 100190, Peoples R China

[2] Ctr Excellence Brain Sci & Intelligence Technol C, Beijing 100190, Peoples R China

[3] Univ Chinese Acad Sci UCAS Beijing, Beijing 100190, Peoples R China

来源：

PATTERN RECOGNITION | 2019年 / 92卷

基金：

国家重点研发计划; 中国国家自然科学基金;

关键词：

Two-stream; Action recognition; Dense-connectivity; Knowledge distillation;

D O I：

10.1016/j.patcog.2019.03.005

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Two-stream convolutional neural networks show great promise for action recognition tasks. However, most two-stream based approaches train the appearance and motion subnetworks independently, which may lead to the decline in performance due to the lack of interactions among two streams. To overcome this limitation, we propose a Spatiotemporal Distilled Dense-Connectivity Network (STDDCN) for video action recognition. This network implements both knowledge distillation and dense-connectivity (adapted from DenseNet). Using this STDDCN architecture, we aim to explore interaction strategies between appearance and motion streams along different hierarchies. Specifically, block-level dense connections between appearance and motion pathways enable spatiotemporal interaction at the feature representation layers. Moreover, knowledge distillation among two streams (each treated as a student) and their last fusion (treated as teacher) allows both streams to interact at the high level layers. The special architecture of STDDCN allows it to gradually obtain effective hierarchical spatiotemporal features. Moreover, it can be trained end-to-end. Finally, numerous ablation studies validate the effectiveness and generalization of our model on two benchmark datasets, including UCF101 and HMDB51. Simultaneously, our model achieves promising performances. (C) 2019 Elsevier Ltd. All rights reserved.

引用

页码：13 / 24

页数：12

共 50 条

[1] Spatiotemporal Pyramid Network for Video Action Recognition
Wang, Yunbo
Long, Mingsheng
Wang, Jianmin
Yu, Philip S.
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 2097 - 2106
[2] Dense Dilated Network for Video Action Recognition
Xu, Baohan
Ye, Hao
Zheng, Yingbin
Wang, Heng
Luwang, Tianyu
Jiang, Yu-Gang
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2019, 28 (10) : 4941 - 4953
[3] Sparse Dense Transformer Network for Video Action Recognition
Qu, Xiaochun
Zhang, Zheyuan
Xiao, Wei
Ran, Jinye
Wang, Guodong
Zhang, Zili
KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, PT II, 2022, 13369 : 43 - 56
[4] Video spatiotemporal mapping for human action recognition by convolutional neural network
Zare, Amin
Abrishami Moghaddam, Hamid
Sharifi, Arash
PATTERN ANALYSIS AND APPLICATIONS, 2020, 23 (01) : 265 - 279
[5] Video spatiotemporal mapping for human action recognition by convolutional neural network
Amin Zare
Hamid Abrishami Moghaddam
Arash Sharifi
Pattern Analysis and Applications, 2020, 23 : 265 - 279
[6] Spatiotemporal squeeze-and-excitation residual multiplier network for video action recognition
Luo H.
Tong K.
Tongxin Xuebao/Journal on Communications, 2019, 40 (10): : 189 - 198
[7] Multi-scale Spatiotemporal Information Fusion Network for Video Action Recognition
Cai, Yutong
Lin, Weiyao
See, John
Cheng, Ming-Ming
Liu, Guangcan
Xiong, Hongkai
2018 IEEE INTERNATIONAL CONFERENCE ON VISUAL COMMUNICATIONS AND IMAGE PROCESSING (IEEE VCIP), 2018,
[8] Spatiotemporal Residual Networks for Video Action Recognition
Feichtenhofer, Christoph
Pinz, Axel
Wildes, Richard P.
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 29 (NIPS 2016), 2016, 29
[9] Spatiotemporal Fusion Networks for Video Action Recognition
Liu, Zheng
Hu, Haifeng
Zhang, Junxuan
NEURAL PROCESSING LETTERS, 2019, 50 (02) : 1877 - 1890
[10] Spatiotemporal Relation Networks for Video Action Recognition
Liu, Zheng
Hu, Haifeng
IEEE ACCESS, 2019, 7 : 14969 - 14976

← 1 2 3 4 5 →