Masked cosine similarity prediction for self-supervised skeleton-based action representation learning

被引:0
|
作者
Ziliang Ren [1 ]
Ronggui Liu [2 ]
Yong Qin [1 ]
Xiangyang Gao [1 ]
Qieshi Zhang [2 ]
机构
[1] Dongguan University of Technology,School of Computer Science and Technology
[2] Chinese Academy of Sciences,CAS Key Laboratory of Human
关键词
Skeleton-based action recognition; Self-supervised learning; Masked autoencoders;
D O I
10.1007/s10044-025-01472-3
中图分类号
学科分类号
摘要
Skeleton-based human action recognition faces challenges owing to the limited availability of annotated data, which constrains the performance of supervised methods in learning representations of skeleton sequences. To address this issue, researchers have introduced self-supervised learning as a method of reducing the reliance on annotated data. This approach exploits the intrinsic supervisory signals embedded within the data itself. In this study, we demonstrate that considering relative positional relationships between joints, rather than relying on joint coordinates as absolute positional information, yields more effective representations of skeleton sequences. Based on this, we introduce the Masked Cosine Similarity Prediction (MCSP) framework, which takes randomly masked skeleton sequences as input and predicts the corresponding cosine similarity between masked joints. Comprehensive experiments show that the proposed MCSP self-supervised pre-training method effectively learns representations in skeleton sequences, improving model performance while decreasing dependence on extensive labeled datasets. After pre-training with MCSP, a vanilla transformer architecture is employed for fine-tuning in action recognition. The results obtained from six subsets of the NTU-RGB+D 60, NTU-RGB+D 120 and PKU-MMD datasets show that our method achieves significant performance improvements on five subsets. Compared to training from scratch, performance improvements are 9.8%, 4.9%, 13%, 11.5%, and 3.6%, respectively, with top-1 accuracies of 92.9%, 97.3%, 89.8%, 91.2%, and 96.1% being achieved. Furthermore, our method achieves comparable results on the PKU-MMD Phase II dataset, achieving a top-1 accuracy of 51.5%. These results are competitive without the need for intricate designs, such as multi-stream model ensembles or extreme data augmentation. The source code of our MOSP is available at https://github.com/skyisyourlimit/MCSP.
引用
收藏
相关论文
共 50 条
  • [41] Towards Latent Masked Image Modeling for Self-supervised Visual Representation Learning
    Wei, Yibing
    Gupta, Abhinav
    Morgado, Pedro
    COMPUTER VISION - ECCV 2024, PT XXXIX, 2025, 15097 : 1 - 17
  • [42] Cross-View Masked Model for Self-Supervised Graph Representation Learning
    Duan H.
    Yu B.
    Xie C.
    IEEE Transactions on Artificial Intelligence, 2024, 5 (11): : 1 - 13
  • [43] Masked self-supervised ECG representation learning via multiview information bottleneck
    Yang, Shunxiang
    Lian, Cheng
    Zeng, Zhigang
    Xu, Bingrong
    Su, Yixin
    Xue, Chenyang
    NEURAL COMPUTING & APPLICATIONS, 2024, 36 (14): : 7625 - 7637
  • [44] Collaboratively Self-Supervised Video Representation Learning for Action Recognition
    Zhang, Jie
    Wan, Zhifan
    Hu, Lanqing
    Lin, Stephen
    Wu, Shuzhe
    Shan, Shiguang
    IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2025, 20 : 1895 - 1907
  • [45] Masked self-supervised ECG representation learning via multiview information bottleneck
    Shunxiang Yang
    Cheng Lian
    Zhigang Zeng
    Bingrong Xu
    Yixin Su
    Chenyang Xue
    Neural Computing and Applications, 2024, 36 : 7625 - 7637
  • [46] MST: Masked Self-Supervised Transformer for Visual Representation
    Li, Zhaowen
    Chen, Zhiyang
    Yang, Fan
    Li, Wei
    Zhu, Yousong
    Zhao, Chaoyang
    Deng, Rui
    Wu, Liwei
    Zhao, Rui
    Tang, Ming
    Wang, Jinqiao
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [47] Self-Supervised Representation Learning via Latent Graph Prediction
    Xie, Yaochen
    Xu, Zhao
    Ji, Shuiwang
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [48] Learning Latent Global Network for Skeleton-Based Action Prediction
    Ke, Qiuhong
    Bennamoun, Mohammed
    Rahmani, Hossein
    An, Senjian
    Sohel, Ferdous
    Boussaid, Farid
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 : 959 - 970
  • [49] Self-Supervised Learning for Multilevel Skeleton-Based Forgery Detection via Temporal-Causal Consistency of Actions
    Hu, Liang
    Liu, Dora D.
    Zhang, Qi
    Naseem, Usman
    Lai, Zhong Yuan
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 1, 2023, : 844 - 853
  • [50] Skeleton-based Self-Supervised Feature Extraction for Improved Dynamic Hand Gesture Recognition
    Ikne, Omar
    Allaert, Benjamin
    Wannous, Hazem
    2024 IEEE 18TH INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION, FG 2024, 2024,