A multi-granular joint tracing transformer for video-based 3D human pose estimation

被引:0
|
作者
Hou, Yingying [1 ]
Huang, Zhenhua [1 ]
Zhu, Wentao [2 ]
机构
[1] Anhui Univ, Hefei 230601, Anhui, Peoples R China
[2] Amazon Res, Seattle, WA 98101 USA
基金
中国国家自然科学基金;
关键词
3D human pose estimation; Joint-tracing transformer; Temporal dependencies; Spatial relationship;
D O I
10.1007/s11760-024-03589-0
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Human pose estimation from monocular images captured by motion capture cameras is a crucial task with a wide range of downstream applications, e.g., action recognition, motion transfer, and movie making. However, previous methods have not effectively addressed the depth blur problem while considering the temporal correlation of individual and multiple body joints together. We address the issue by simultaneously exploiting the temporal information at both single-joint and multiple-joint granularities. Inspired by the observation that different body joints have different moving trajectories and can be correlated with others, we proposed an approach called the multi-granularity joint tracing transformer (MOTT). MOTT consists of two main components: (1) a spatial transformer that encodes each frame to obtain spatial embeddings of all joints, and (2) a multi-granularity temporal transformer that includes both a holistic temporal transformer to handle the temporal correlation between all joints in consecutive frames and a joint tracing temporal transformer to process the temporal embedding of each particular joint. The outputs of the two branches are fused to produce accurate 3D human poses. Extensive experiments on Human3.6M and MPI-INF-3DHP datasets demonstrate that MOTT effectively encodes the spatial and temporal dependencies between body joints and outperforms previous methods in terms of mean per joint position error.
引用
收藏
页数:15
相关论文
共 50 条
  • [21] Multi-hop Graph Transformer Network for 3D Human Pose Estimation
    Islam, Zaedul
    Hamza, A. Ben
    arXiv,
  • [22] Multi-hop graph transformer network for 3D human pose estimation
    Islam, Zaedul
    Ben Hamza, A.
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2024, 101
  • [23] ICRFormer: An Improving Cos-Reweighting Transformer for 3D Human Pose Estimation in Video
    Zhang, Kaixu
    Luan, Xiaoming
    Syed, Tafseer Haider Shah
    Xiang, Xuezhi
    2023 35TH CHINESE CONTROL AND DECISION CONFERENCE, CCDC, 2023, : 436 - 441
  • [24] Learning Multi-Granular Hypergraphs for Video-Based Person Re-Identification
    Yan, Yichao
    Qin, Jie
    Chen, Jiaxin
    Liu, Li
    Zhu, Fan
    Tai, Ying
    Shao, Ling
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 2896 - 2905
  • [25] STTR-3D: Stereo Transformer 3D Network for Video-Based Disparity Change Estimation
    Yang, Qitong
    Rakai, Lionel
    Sun, Shijie
    Song, Huansheng
    Song, Xiangyu
    Akhtar, Naveed
    WEB AND BIG DATA, PT IV, APWEB-WAIM 2023, 2024, 14334 : 217 - 231
  • [26] Joint multi-scale transformers and pose equivalence constraints for 3D human pose estimation
    Wu, Yongpeng
    Kong, Dehui
    Gao, Junna
    Li, Jinghua
    Yin, Baocai
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2024, 103
  • [27] Gait recognition based on 3D human body reconstruction and multi-granular feature fusion
    Meng, Chunyun
    He, Xiaobing
    Tan, Zhen
    Luan, Li
    JOURNAL OF SUPERCOMPUTING, 2023, 79 (11): : 12106 - 12125
  • [28] Gait recognition based on 3D human body reconstruction and multi-granular feature fusion
    Chunyun Meng
    Xiaobing He
    Zhen Tan
    Li Luan
    The Journal of Supercomputing, 2023, 79 : 12106 - 12125
  • [29] Joint Path Alignment Framework for 3D Human Pose and Shape Estimation From Video
    Hong, Ji Woo
    Yoon, Sunjae
    Kim, Junyeong
    Yoo, Chang D.
    IEEE ACCESS, 2023, 11 : 43267 - 43275
  • [30] Split-and-recombine and vision transformer based 3D human pose estimation
    Lu, Xinyi
    Xu, Fan
    Hu, Shuiyi
    Yu, Tianqi
    Hu, Jianling
    SIGNAL IMAGE AND VIDEO PROCESSING, 2025, 19 (01)