Masked cosine similarity prediction for self-supervised skeleton-based action representation learning

被引:0
|
作者
Ziliang Ren [1 ]
Ronggui Liu [2 ]
Yong Qin [1 ]
Xiangyang Gao [1 ]
Qieshi Zhang [2 ]
机构
[1] Dongguan University of Technology,School of Computer Science and Technology
[2] Chinese Academy of Sciences,CAS Key Laboratory of Human
关键词
Skeleton-based action recognition; Self-supervised learning; Masked autoencoders;
D O I
10.1007/s10044-025-01472-3
中图分类号
学科分类号
摘要
Skeleton-based human action recognition faces challenges owing to the limited availability of annotated data, which constrains the performance of supervised methods in learning representations of skeleton sequences. To address this issue, researchers have introduced self-supervised learning as a method of reducing the reliance on annotated data. This approach exploits the intrinsic supervisory signals embedded within the data itself. In this study, we demonstrate that considering relative positional relationships between joints, rather than relying on joint coordinates as absolute positional information, yields more effective representations of skeleton sequences. Based on this, we introduce the Masked Cosine Similarity Prediction (MCSP) framework, which takes randomly masked skeleton sequences as input and predicts the corresponding cosine similarity between masked joints. Comprehensive experiments show that the proposed MCSP self-supervised pre-training method effectively learns representations in skeleton sequences, improving model performance while decreasing dependence on extensive labeled datasets. After pre-training with MCSP, a vanilla transformer architecture is employed for fine-tuning in action recognition. The results obtained from six subsets of the NTU-RGB+D 60, NTU-RGB+D 120 and PKU-MMD datasets show that our method achieves significant performance improvements on five subsets. Compared to training from scratch, performance improvements are 9.8%, 4.9%, 13%, 11.5%, and 3.6%, respectively, with top-1 accuracies of 92.9%, 97.3%, 89.8%, 91.2%, and 96.1% being achieved. Furthermore, our method achieves comparable results on the PKU-MMD Phase II dataset, achieving a top-1 accuracy of 51.5%. These results are competitive without the need for intricate designs, such as multi-stream model ensembles or extreme data augmentation. The source code of our MOSP is available at https://github.com/skyisyourlimit/MCSP.
引用
收藏
相关论文
共 50 条
  • [21] Modeling the Uncertainty for Self-supervised 3D Skeleton Action Representation Learning
    Su, Yukun
    Lin, Guosheng
    Sun, Ruizhou
    Hao, Yun
    Wu, Qingyao
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 769 - 778
  • [22] Masked Motion Encoding for Self-Supervised Video Representation Learning
    Sun, Xinyu
    Chen, Peihao
    Chen, Liangwei
    Li, Changhao
    Li, Thomas H.
    Tan, Mingkui
    Gan, Chuang
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2235 - 2245
  • [23] DIDA: Dynamic Individual-to-integrateD Augmentation for Self-supervised Skeleton-Based Action Recognition
    Hu, Haobo
    Li, Jianan
    Fan, Hongbin
    Zhao, Zhifu
    Zhou, Yangtao
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT VII, 2025, 15037 : 496 - 510
  • [24] DMMG: Dual Min-Max Games for Self-Supervised Skeleton-Based Action Recognition
    Guan, Shannan
    Yu, Xin
    Huang, Wei
    Fang, Gengfa
    Lu, Haiyan
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 395 - 407
  • [25] Hierarchically Self-supervised Transformer for Human Skeleton Representation Learning
    Chen, Yuxiao
    Zhao, Long
    Yuan, Jianbo
    Tian, Yu
    Xia, Zhaoyang
    Geng, Shijie
    Han, Ligong
    Metaxas, Dimitris N.
    COMPUTER VISION, ECCV 2022, PT XXVI, 2022, 13686 : 185 - 202
  • [26] Bootstrapped Representation Learning for Skeleton-Based Action Recognition
    Moliner, Olivier
    Huang, Sangxia
    Astrom, Kalle
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, : 4153 - 4163
  • [27] Self-supervised Learning for Unintentional Action Prediction
    Zatsarynna, Olga
    Abu Farha, Yazan
    Gall, Juergen
    PATTERN RECOGNITION, DAGM GCPR 2022, 2022, 13485 : 429 - 444
  • [28] Self-Supervised Action Representation Learning from Partial Spatio-Temporal Skeleton Sequences
    Zhou, Yujie
    Duan, Haodong
    Rao, Anyi
    Su, Bing
    Wang, Jiaqi
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 3, 2023, : 3825 - 3833
  • [29] Self-supervised 3D Skeleton Action Representation Learning with Motion Consistency and Continuity
    Su, Yukun
    Lin, Guosheng
    Wu, Qingyao
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 13308 - 13318
  • [30] Exploring Set Similarity for Dense Self-supervised Representation Learning
    Wang, Zhaoqing
    Li, Qiang
    Zhang, Guoxin
    Wan, Pengfei
    Zheng, Wen
    Wang, Nannan
    Gong, Mingming
    Liu, Tongliang
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 16569 - 16578