Masked cosine similarity prediction for self-supervised skeleton-based action representation learning

被引:0
|
作者
Ziliang Ren [1 ]
Ronggui Liu [2 ]
Yong Qin [1 ]
Xiangyang Gao [1 ]
Qieshi Zhang [2 ]
机构
[1] Dongguan University of Technology,School of Computer Science and Technology
[2] Chinese Academy of Sciences,CAS Key Laboratory of Human
关键词
Skeleton-based action recognition; Self-supervised learning; Masked autoencoders;
D O I
10.1007/s10044-025-01472-3
中图分类号
学科分类号
摘要
Skeleton-based human action recognition faces challenges owing to the limited availability of annotated data, which constrains the performance of supervised methods in learning representations of skeleton sequences. To address this issue, researchers have introduced self-supervised learning as a method of reducing the reliance on annotated data. This approach exploits the intrinsic supervisory signals embedded within the data itself. In this study, we demonstrate that considering relative positional relationships between joints, rather than relying on joint coordinates as absolute positional information, yields more effective representations of skeleton sequences. Based on this, we introduce the Masked Cosine Similarity Prediction (MCSP) framework, which takes randomly masked skeleton sequences as input and predicts the corresponding cosine similarity between masked joints. Comprehensive experiments show that the proposed MCSP self-supervised pre-training method effectively learns representations in skeleton sequences, improving model performance while decreasing dependence on extensive labeled datasets. After pre-training with MCSP, a vanilla transformer architecture is employed for fine-tuning in action recognition. The results obtained from six subsets of the NTU-RGB+D 60, NTU-RGB+D 120 and PKU-MMD datasets show that our method achieves significant performance improvements on five subsets. Compared to training from scratch, performance improvements are 9.8%, 4.9%, 13%, 11.5%, and 3.6%, respectively, with top-1 accuracies of 92.9%, 97.3%, 89.8%, 91.2%, and 96.1% being achieved. Furthermore, our method achieves comparable results on the PKU-MMD Phase II dataset, achieving a top-1 accuracy of 51.5%. These results are competitive without the need for intricate designs, such as multi-stream model ensembles or extreme data augmentation. The source code of our MOSP is available at https://github.com/skyisyourlimit/MCSP.
引用
收藏
相关论文
共 50 条
  • [31] GMAEEG: A Self-Supervised Graph Masked Autoencoder for EEG Representation Learning
    Fu, Zanhao
    Zhu, Huaiyu
    Zhao, Yisheng
    Huan, Ruohong
    Zhang, Yi
    Chen, Shuohui
    Pan, Yun
    IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2024, 28 (11) : 6486 - 6497
  • [32] GMAEEG: A Self-Supervised Graph Masked Autoencoder for EEG Representation Learning
    Fu, Zanhao
    Zhu, Huaiyu
    Zhao, Yisheng
    Huan, Ruohong
    Zhang, Yi
    Chen, Shuohui
    Pan, Yun
    IEEE Journal of Biomedical and Health Informatics, 2024, 28 (11): : 6486 - 6497
  • [33] SCD-Net: Spatiotemporal Clues Disentanglement Network for Self-Supervised Skeleton-Based Action Recognition
    Wu, Cong
    Wu, Xiao-Jun
    Kittler, Josef
    Xu, Tianyang
    Ahmed, Sara
    Awais, Muhammad
    Feng, Zhenhua
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 6, 2024, : 5949 - 5957
  • [34] InfoGCN: Representation Learning for Human Skeleton-based Action Recognition
    Chi, Hyung-gun
    Ha, Myoung Hoon
    Chi, Seunggeun
    Lee, Sang Wan
    Huang, Qixing
    Ramani, Karthik
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 20154 - 20164
  • [35] Hierarchical Contrast for Unsupervised Skeleton-Based Action Representation Learning
    Dong, Jianfeng
    Sun, Shengkai
    Liu, Zhonglin
    Chen, Shujie
    Liu, Baolong
    Wang, Xun
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 1, 2023, : 525 - 533
  • [36] Idempotent Unsupervised Representation Learning for Skeleton-Based Action Recognition
    Lin, Lilang
    Wu, Lehong
    Zhang, Jiahang
    Wang, Jiaying
    COMPUTER VISION - ECCV 2024, PT XXVI, 2025, 15084 : 75 - 92
  • [37] Representation Learning of Temporal Dynamics for Skeleton-Based Action Recognition
    Du, Yong
    Fu, Yun
    Wang, Liang
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2016, 25 (07) : 3010 - 3022
  • [38] Bayesian Contrastive Learning with Manifold Regularization for Self-Supervised Skeleton Based Action Recognition
    Lin, Lilang
    Zhang, Jiahang
    Liu, Jiaying
    Proceedings - IEEE International Symposium on Circuits and Systems, 2023, 2023-May
  • [39] Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning
    Wang, Rui
    Chen, Dongdong
    Wu, Zuxuan
    Chen, Yinpeng
    Dai, Xiyang
    Liu, Mengchen
    Yuan, Lu
    Jiang, Yu-Gang
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6312 - 6322
  • [40] Bayesian Contrastive Learning with Manifold Regularization for Self-Supervised Skeleton Based Action Recognition
    Lin, Lilang
    Zhang, Jiahang
    Liu, Jiaying
    2023 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, ISCAS, 2023,