D3D: Dual 3-D Convolutional Network for Real-Time Action Recognition

被引:29
|
作者
Jiang, Shengqin [1 ,2 ]
Qi, Yuankai [3 ]
Zhang, Haokui [4 ]
Bai, Zongwen [5 ,6 ]
Lu, Xiaobo [1 ,2 ]
Wang, Peng [7 ]
机构
[1] Southeast Univ, Sch Automat, Nanjing 210096, Peoples R China
[2] Minist Educ, Key Lab Measurement & Control Complex Syst Engn, Nanjing 210096, Peoples R China
[3] Harbin Inst Technol, Sch Comp Sci & Technol, Weihai 264209, Peoples R China
[4] Northwestern Polytech Univ, Sch Comp Sci, Xian 710129, Peoples R China
[5] Shaanxi Key Lab Intelligent Proc Big Energy Data, Yanan 716000, Peoples R China
[6] Yanan Univ, Sch Phys & Elect Informat, Yanan 716000, Peoples R China
[7] Univ Wollongong, Sch Comp & Informat Technol, Wollongong, NSW 2170, Australia
基金
中国国家自然科学基金;
关键词
Three-dimensional displays; Feature extraction; Convolution; Two dimensional displays; Streaming media; Kernel; Informatics; Three-dimensional convolutional neural networks (3D CNNs); action recognition; lightweight network; spatio-temporal information;
D O I
10.1109/TII.2020.3018487
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Three-dimensional convolutional neural networks (3D CNNs) have been explored to learn spatio-temporal information for video-based human action recognition. Expensive computational cost and memory demand resulted from standard 3D CNNs, however, hinder their application in practical scenarios. In this article, we address the aforementioned limitations by proposing a novel dual 3-D convolutional network (D3DNet) with two complementary lightweight branches. A coarse branch maintains large temporal receptive field by a fast temporal downsampling strategy and simulates the expensive 3-D convolutions using a combination of more efficient spatial convolutions and temporal convolutions. Meanwhile, a fine branch progressively downsamples the video in the temporal domain and adopts 3-D convolutional units with reduced channel capacities to capture multiresolution spatio-temporal information. Instead of learning these two branches independently, a shallow spatiotemporal downsampling module is shared for these two branches for efficient low-level feature learning. Besides, lateral connections are learned to effectively fuse the information from the two branches at multiple stages. The proposed network makes good balance between inference speed and action recognition performance. Based on RGB information only, it achieves competing performance on five popular video-based action recognition datasets, with inference speed of 3200 FPS on a single NVIDIA GTX 2080Ti card.
引用
收藏
页码:4584 / 4593
页数:10
相关论文
共 50 条
  • [1] T-C3D: Temporal Convolutional 3D Network for Real-Time Action Recognition
    Liu, Kun
    Liu, Wu
    Gan, Chuang
    Tan, Mingkui
    Ma, Huadong
    THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 7138 - 7145
  • [2] OctreeNet: A Novel Sparse 3-D Convolutional Neural Network for Real-Time 3-D Outdoor Scene Analysis
    Wang, Fei
    Zhuang, Yan
    Gu, Hong
    Hu, Huosheng
    IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, 2020, 17 (02) : 735 - 747
  • [3] D3D: Distilled 3D Networks for Video Action Recognition
    Stroud, Jonathan C.
    Ross, David A.
    Sun, Chen
    Deng, Jia
    Sukthankar, Rahul
    2020 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2020, : 614 - 623
  • [4] Real-Time 3-D Human Action Recognition Based on Hyperpoint Sequence
    Li, Xing
    Huang, Qian
    Wang, Zhijian
    Yang, Tianjin
    Hou, Zhenjie
    Miao, Zhuang
    IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2023, 19 (08) : 8933 - 8942
  • [5] VoxNet: A 3D Convolutional Neural Network for Real-Time Object Recognition
    Maturana, Daniel
    Scherer, Sebastian
    2015 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2015, : 922 - 928
  • [6] Action Recognition by 3D Convolutional Network
    Brezovsky, Matus
    Sopiak, Dominik
    Oravec, Milos
    PROCEEDINGS OF ELMAR-2018: 60TH INTERNATIONAL SYMPOSIUM ELMAR-2018, 2018, : 71 - 74
  • [7] Real-time 2D+3D facial action and expression recognition
    Tsalakanidou, Filareti
    Malassiotis, Sotiris
    PATTERN RECOGNITION, 2010, 43 (05) : 1763 - 1775
  • [8] PointNet: A 3D Convolutional Neural Network for Real-Time Object Class Recognition
    Garcia-Garcia, A.
    Gomez-Donoso, F.
    Garcia-Rodriguez, J.
    Orts-Escolano, S.
    Cazorla, M.
    Azorin-Lopez, J.
    2016 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2016, : 1578 - 1584
  • [9] Real-time rendering of 3-D scenes using subband 3-D warping
    Bao, P
    Gourley, D
    IEEE TRANSACTIONS ON MULTIMEDIA, 2004, 6 (06) : 786 - 790
  • [10] 3-D real-time gesture recognition using proximity spaces
    Huber, E
    THIRD IEEE WORKSHOP ON APPLICATIONS OF COMPUTER VISION - WACV '96, PROCEEDINGS, 1996, : 136 - 141