Skeleton MixFormer: Multivariate Topology Representation for Skeleton-based Action Recognition

被引:15
|
作者
Xin, Wentian [1 ]
Miao, Qiguang [1 ]
Liu, Yi [1 ]
Liu, Ruyi [1 ]
Pun, Chi-Man [2 ]
Shi, Cheng [3 ]
机构
[1] Xidian Univ, Xian, Peoples R China
[2] Univ Macau, Macau, Peoples R China
[3] Xian Univ Technol, Xian, Peoples R China
基金
国家重点研发计划;
关键词
video understanding; skeleton action recognition; topology representation; transformer; attention;
D O I
10.1145/3581783.3611900
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision Transformer, which performs well in various vision tasks, encounters a bottleneck in skeleton-based action recognition and falls short of advanced GCN-based methods. The root cause is that the current skeleton transformer depends on the self-attention mechanism of the complete channel of the global joint, ignoring the highly discriminative differential correlation within the channel, so it is challenging to learn the expression of the multivariate topology dynamically. To tackle this, we present Skeleton MixFormer, an innovative spatio-temporal architecture to effectively represent the physical correlations and temporal interactivity of the compact skeleton data. Two essential components make up the proposed framework: 1) Spatial MixFormer. The channel-grouping and mix-attention are utilized to calculate the dynamic multivariate topological relationships. Compared with the full-channel self-attention method, Spatial MixFormer better highlights the channel groups' discriminative differences and the joint adjacency's interpretable learning. 2) Temporal MixFormer, which consists of Multiscale Convolution, Temporal Transformer and Sequential Holding Module. The multivariate temporal models ensure the richness of global difference expression and realize the discrimination of crucial intervals in the sequence, thereby enabling more effective learning of long and short-term dependencies in actions. Our Skeleton Mix-Former demonstrates state-of-the-art (SOTA) performance across seven different settings on four standard datasets, namely NTU-60, NTU-120, NW-UCLA, and UAV-Human. Related code will be available on Skeleton-MixFormer.
引用
收藏
页码:2211 / 2220
页数:10
相关论文
共 50 条
  • [1] Bootstrapped Representation Learning for Skeleton-Based Action Recognition
    Moliner, Olivier
    Huang, Sangxia
    Astrom, Kalle
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, : 4153 - 4163
  • [2] BlockGCN: Redefine Topology Awareness for Skeleton-Based Action Recognition
    Zhou, Yuxuan
    Yan, Xudong
    Cheng, Zhi-Qi
    Yan, Yan
    Dai, Qi
    Hua, Xian-Sheng
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 2049 - 2058
  • [3] InfoGCN: Representation Learning for Human Skeleton-based Action Recognition
    Chi, Hyung-gun
    Ha, Myoung Hoon
    Chi, Seunggeun
    Lee, Sang Wan
    Huang, Qixing
    Ramani, Karthik
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 20154 - 20164
  • [4] A High Invariance Motion Representation for Skeleton-Based Action Recognition
    Guo, Songrui
    Pan, Huawei
    Tan, Guanghua
    Chen, Lin
    Gao, Chunming
    INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2016, 30 (08)
  • [5] Idempotent Unsupervised Representation Learning for Skeleton-Based Action Recognition
    Lin, Lilang
    Wu, Lehong
    Zhang, Jiahang
    Wang, Jiaying
    COMPUTER VISION - ECCV 2024, PT XXVI, 2025, 15084 : 75 - 92
  • [6] Representation Learning of Temporal Dynamics for Skeleton-Based Action Recognition
    Du, Yong
    Fu, Yun
    Wang, Liang
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2016, 25 (07) : 3010 - 3022
  • [7] Improved Graph Convolutional Network with Enriched Graph Topology Representation for Skeleton-Based Action Recognition
    Alsarhan, Tamam
    Harfoushi, Osama
    Shdefat, Ahmed Younes
    Mostafa, Nour
    Alshinwan, Mohammad
    Ali, Ahmad
    ELECTRONICS, 2023, 12 (04)
  • [8] Revisiting Skeleton-based Action Recognition
    Duan, Haodong
    Zhao, Yue
    Chen, Kai
    Lin, Dahua
    Dai, Bo
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 2959 - 2968
  • [9] Adaptive Spatiotemporal Representation Learning for Skeleton-Based Human Action Recognition
    Yu, Jiahui
    Gao, Hongwei
    Chen, Yongquan
    Zhou, Dalin
    Liu, Jinguo
    Ju, Zhaojie
    IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, 2022, 14 (04) : 1654 - 1665
  • [10] A Novel Skeleton Spatial Pyramid Model for Skeleton-based Action Recognition
    Li, Yanshan
    Guo, Tianyu
    Xia, Rongjie
    Liu, Xing
    2019 IEEE 4TH INTERNATIONAL CONFERENCE ON SIGNAL AND IMAGE PROCESSING (ICSIP 2019), 2019, : 16 - 20