Action Recognition with Multi-stream Motion Modeling and Mutual Information Maximization

被引:0
|
作者
Yang, Yuheng [1 ]
Chen, Haipeng [1 ]
Liu, Zhenguang [2 ]
Lyu, Yingda [3 ]
Zhang, Beibei [5 ]
Wu, Shuang [4 ]
Wang, Zhibo [2 ]
Ren, Kui [2 ]
机构
[1] Jilin Univ, Coll Comp Sci & Technol, Jilin, Peoples R China
[2] Zhejiang Univ, Sch Cyber Sci & Technol, Hangzhou, Peoples R China
[3] Jilin Univ, Publ Comp Educ & Res Ctr, Jilin, Peoples R China
[4] Black Sesame Technol, Solaris, Singapore
[5] Zhejiang Lab, Hangzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Action recognition has long been a fundamental and intriguing problem in artificial intelligence. The task is challenging due to the high dimensionality nature of an action, as well as the subtle motion details to be considered. Current state-of-the-art approaches typically learn from articulated motion sequences in the straightforward 3D Euclidean space. However, the vanilla Euclidean space is not efficient for modeling important motion characteristics such as the joint-wise angular acceleration, which reveals the driving force behind the motion. Moreover, current methods typically attend to each channel equally and lack theoretical constrains on extracting task-relevant features from the input. In this paper, we seek to tackle these challenges from three aspects: (1) We propose to incorporate an acceleration representation, explicitly modeling the higher-order variations in motion. (2) We introduce a novel Stream-GCN network equipped with multi-stream components and channel attention, where different representations (i.e., streams) supplement each other towards a more precise action recognition while attention capitalizes on those important channels. (3) We explore feature-level supervision for maximizing the extraction of task-relevant information and formulate this into a mutual information loss. Empirically, our approach sets the new state-of-the-art performance on three benchmark datasets, NTU RGB+D, NTU RGB+D 120, and NW-UCLA.
引用
收藏
页码:1658 / 1666
页数:9
相关论文
共 50 条
  • [1] Motion saliency based multi-stream multiplier ResNets for action recognition
    Zong, Ming
    Wang, Ruili
    Chen, Xiubo
    Chen, Zhe
    Gong, Yuanhao
    IMAGE AND VISION COMPUTING, 2021, 107 (107)
  • [2] Research on Grey Modeling for Multi-stream Information
    Liu, Xin
    Dai, Jin
    Zhou, Weijie
    JOURNAL OF GREY SYSTEM, 2016, 28 (04): : 127 - 137
  • [3] Multi-Stream Interaction Networks for Human Action Recognition
    Wang, Haoran
    Yu, Baosheng
    Li, Jiaqi
    Zhang, Linlin
    Chen, Dongyue
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (05) : 3050 - 3060
  • [4] Contextual Action Cues from Camera Sensor for Multi-Stream Action Recognition
    Hong, Jongkwang
    Cho, Bora
    Hong, Yong Won
    Byun, Hyeran
    SENSORS, 2019, 19 (06)
  • [5] Multi-stream Global-Local Motion Fusion Network for skeleton-based action recognition
    Qi, Yanpeng
    Pang, Chen
    Liu, Yiliang
    Lyu, Lei
    APPLIED SOFT COMPUTING, 2023, 145
  • [6] Multi-stream network with key frame sampling for human action recognition
    Xia, Limin
    Wen, Xin
    JOURNAL OF SUPERCOMPUTING, 2024, 80 (09): : 11958 - 11988
  • [7] Viewpoint guided multi-stream neural network for skeleton action recognition
    He, Yicheng
    Liang, Zixi
    He, Shaocong
    Wang, Yonghua
    Yin, Ming
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (03) : 6783 - 6802
  • [8] Skeleton Feature Fusion Based on Multi-Stream LSTM for Action Recognition
    Wang, Lei
    Zhao, Xu
    Liu, Yuncai
    IEEE ACCESS, 2018, 6 : 50788 - 50800
  • [9] Viewpoint guided multi-stream neural network for skeleton action recognition
    Yicheng He
    Zixi Liang
    Shaocong He
    Yonghua Wang
    Ming Yin
    Multimedia Tools and Applications, 2024, 83 : 6783 - 6802
  • [10] Multi-stream asynchrony modeling for audio-visual speech recognition
    Lv, Guoyun
    Jiang, Dongmei
    Zhao, Rongchun
    Hou, Yunshu
    ISM 2007: NINTH IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA, PROCEEDINGS, 2007, : 37 - 44