Modeling the skeleton-language uncertainty for 3D action recognition

被引:1
|
作者
Wang, Mingdao [1 ]
Zhang, Xianlin [2 ]
Chen, Siqi [1 ]
Li, Xueming [2 ]
Zhang, Yue [2 ]
机构
[1] Beijing Univ Posts & Telecommun, Sch Artificial Intelligence, Beijing 100000, Peoples R China
[2] Beijing Univ Posts & Telecommun, Sch Digital Media & Design Arts, Beijing, Peoples R China
关键词
Uncertainty; Multimodal model; 3D skeleton-based action recognition; NETWORKS;
D O I
10.1016/j.neucom.2024.128426
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Human 3D skeleton-based action recognition has received increasing interest in recent years. Inspired by the excellent ability of the multi-modal model, some pioneer attempts to employ diverse modalities, i.e., skeleton and language, to construct the skeleton-language model and have shown compelling results. Yet, these attempts model the data representation as deterministic point estimation, ignoring a key issue that descriptions of similar motions are uncertain and ambiguous, which brings about restricted comprehension of complex concept hierarchies and impoverished cross-modal alignment reliability. To tackle this challenge, this paper proposes a new Uncertain Skeleton-Language Learning Framework (USLLF) to capture the semantic ambiguity among diverse modalities in a probabilistic manner for the first time. USLLF consists of both inter- and intra-modal uncertainties. Specifically, first, we integrate the language (text) generated by ChatGPT with the generic skeleton-based network and develop a deterministic multi-modal baseline, which can be easily achieved via any off-the-shelf skeleton and text encoders. Then, based on this baseline, we explicitly model the intra-modal (skeleton/language) uncertainties as the Gaussian distributions using the new uncertainty networks capable of learning the distributional embeddings of modalities. Following this, these embeddings are aligned and formulated as inter-modal (skeleton-language) uncertainty using both the contrastive and negative log- likelihood objectives to alleviate the cross-modal alignment error. Experimental results on NTU RGB+D, NTU RGB+D 120, and NW-UCLA datasets show that our approach outperforms the proposed baseline and achieves comparable performance with a high inference efficiency compared to the state-of-the-art methods. Besides, we also deliver insightful analyses on how learned uncertainty reduces the impact of uncertain and ambiguous data on model performance.
引用
收藏
页数:14
相关论文
共 50 条
  • [1] Modeling the Uncertainty for Self-supervised 3D Skeleton Action Representation Learning
    Su, Yukun
    Lin, Guosheng
    Sun, Ruizhou
    Hao, Yun
    Wu, Qingyao
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 769 - 778
  • [2] A New Representation of Skeleton Sequences for 3D Action Recognition
    Ke, Qiuhong
    Bennamoun, Mohammed
    An, Senjian
    Sohel, Ferdous
    Boussaid, Farid
    30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4570 - 4579
  • [3] Action Recognition Based on 3D Skeleton and RGB Frame Fusion
    Liu, Guiyu
    Qian, Jiuchao
    Wen, Fei
    Zhu, Xiaoguang
    Ying, Rendong
    Liu, Peilin
    2019 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2019, : 258 - 264
  • [4] Human Action Recognition Based on Quaternion 3D Skeleton Representation
    Xu Haiyang
    Kong Jun
    Jiang Min
    LASER & OPTOELECTRONICS PROGRESS, 2018, 55 (02)
  • [5] ACTION RECOGNITION USING JOINT COORDINATES OF 3D SKELETON DATA
    Batabyal, Tamal
    Chattopadhyay, Tanushyam
    Mukherjee, Dipti Prasad
    2015 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2015, : 4107 - 4111
  • [6] Infrared and 3D Skeleton Feature Fusion for RGB-D Action Recognition
    De Boissiere, Alban Main
    Noumeir, Rita
    IEEE ACCESS, 2020, 8 (08): : 168297 - 168308
  • [7] Continuous Sign Language Recognition Based on 3D Hand Skeleton Data
    Wang Z.
    Zhang J.
    Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2021, 33 (12): : 1899 - 1907
  • [8] Hierarchical topic modeling with pose-transition feature for action recognition using 3D skeleton data
    Thien Huynh-The
    Hua, Cam-Hao
    Nguyen Anh Tu
    Hur, Taeho
    Bang, Jaehun
    Kim, Dohyeong
    Amin, Muhammad Bilal
    Kang, Byeong Ho
    Seung, Hyonwoo
    Shin, Soo Yong
    Kim, Eun-Soo
    Lee, Sungyoung
    INFORMATION SCIENCES, 2018, 444 : 20 - 35
  • [9] Arm-hand Action Recognition Based on 3D Skeleton Joints
    Rui, Ling
    Ma, Shi-wei
    Wen, Jia-rui
    Liu, Li-na
    INTERNATIONAL CONFERENCE ON CONTROL AND AUTOMATION (ICCA 2016), 2016, : 326 - 332
  • [10] Deep learning-based action recognition with 3D skeleton: A survey
    Xing, Yuling
    Zhu, Jia
    CAAI TRANSACTIONS ON INTELLIGENCE TECHNOLOGY, 2021, 6 (01) : 80 - 92