Modeling the skeleton-language uncertainty for 3D action recognition

被引:1
|
作者
Wang, Mingdao [1 ]
Zhang, Xianlin [2 ]
Chen, Siqi [1 ]
Li, Xueming [2 ]
Zhang, Yue [2 ]
机构
[1] Beijing Univ Posts & Telecommun, Sch Artificial Intelligence, Beijing 100000, Peoples R China
[2] Beijing Univ Posts & Telecommun, Sch Digital Media & Design Arts, Beijing, Peoples R China
关键词
Uncertainty; Multimodal model; 3D skeleton-based action recognition; NETWORKS;
D O I
10.1016/j.neucom.2024.128426
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Human 3D skeleton-based action recognition has received increasing interest in recent years. Inspired by the excellent ability of the multi-modal model, some pioneer attempts to employ diverse modalities, i.e., skeleton and language, to construct the skeleton-language model and have shown compelling results. Yet, these attempts model the data representation as deterministic point estimation, ignoring a key issue that descriptions of similar motions are uncertain and ambiguous, which brings about restricted comprehension of complex concept hierarchies and impoverished cross-modal alignment reliability. To tackle this challenge, this paper proposes a new Uncertain Skeleton-Language Learning Framework (USLLF) to capture the semantic ambiguity among diverse modalities in a probabilistic manner for the first time. USLLF consists of both inter- and intra-modal uncertainties. Specifically, first, we integrate the language (text) generated by ChatGPT with the generic skeleton-based network and develop a deterministic multi-modal baseline, which can be easily achieved via any off-the-shelf skeleton and text encoders. Then, based on this baseline, we explicitly model the intra-modal (skeleton/language) uncertainties as the Gaussian distributions using the new uncertainty networks capable of learning the distributional embeddings of modalities. Following this, these embeddings are aligned and formulated as inter-modal (skeleton-language) uncertainty using both the contrastive and negative log- likelihood objectives to alleviate the cross-modal alignment error. Experimental results on NTU RGB+D, NTU RGB+D 120, and NW-UCLA datasets show that our approach outperforms the proposed baseline and achieves comparable performance with a high inference efficiency compared to the state-of-the-art methods. Besides, we also deliver insightful analyses on how learned uncertainty reduces the impact of uncertain and ambiguous data on model performance.
引用
收藏
页数:14
相关论文
共 50 条
  • [31] Human skeleton representation for 3D action recognition based on complex network coding and LSTM
    Shen, Xiangpei
    Ding, Yanrui
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2022, 82
  • [32] A three-stream fusion network for 3D skeleton-based action recognition
    Fang, Ming
    Liu, Qi
    Ren, Jianping
    Li, Jie
    Du, Xinning
    Liu, Shuhua
    MULTIMEDIA SYSTEMS, 2025, 31 (02)
  • [33] Fuzzy Integral-Based CNN Classifier Fusion for 3D Skeleton Action Recognition
    Banerjee, Avinandan
    Singh, Pawan Kumar
    Sarkar, Ram
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021, 31 (06) : 2206 - 2216
  • [34] Deep Learning-Based Action Recognition Using 3D Skeleton Joints Information
    Tasnim, Nusrat
    Islam, Md. Mahbubul
    Baek, Joong-Hwan
    INVENTIONS, 2020, 5 (03) : 1 - 15
  • [35] Skeleton Image Representation for 3D Action Recognition based on Tree Structure and Reference Joints
    Caetano, Carlos
    Bremond, Francois
    Schwartz, William Robson
    2019 32ND SIBGRAPI CONFERENCE ON GRAPHICS, PATTERNS AND IMAGES (SIBGRAPI), 2019, : 16 - 23
  • [36] Rethinking the ST-GCNs for 3D skeleton-based human action recognition
    Peng, Wei
    Shi, Jingang
    Varanka, Tuomas
    Zhao, Guoying
    NEUROCOMPUTING, 2021, 454 : 45 - 53
  • [37] 3D TRAJECTORIES FOR ACTION RECOGNITION
    Koperski, Michal
    Bilinski, Piotr
    Bremond, Francois
    2014 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2014, : 4176 - 4180
  • [38] Language-guided temporal primitive modeling for skeleton-based action recognition
    Pan, Qingzhe
    Xie, Xuemei
    NEUROCOMPUTING, 2025, 613
  • [39] SKELETON-BASED MODELING OF 3D SURFACES
    Jankauskas, Kestutis
    Noreika, Algirdas
    INFORMATION TECHNOLOGIES' 2009, 2009, : 235 - 242
  • [40] Hankelet-based dynamical systems modeling for 3D action recognition
    Lo Presti, Liliana
    La Cascia, Marco
    Sclaroff, Stan
    Camps, Octavia
    IMAGE AND VISION COMPUTING, 2015, 44 : 29 - 43