Self-supervised Vision Transformers for 3D pose estimation of novel objects

被引:1
|
作者
Thalhammer, Stefan [1 ]
Weibel, Jean-Baptiste [1 ]
Vincze, Markus [1 ]
Garcia-Rodriguez, Jose [2 ]
机构
[1] TU Wien, Automat & Control Inst, Gusshausstr 27-29, A-1040 Vienna, Austria
[2] Univ Alicante, Dept Comp Technol, Carr San Vicente del Raspeig, Alicante 03690, Spain
关键词
Object pose estimation; Template matching; Vision transformer; Self-supervised learning;
D O I
10.1016/j.imavis.2023.104816
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Object pose estimation is important for object manipulation and scene understanding. In order to improve the general applicability of pose estimators, recent research focuses on providing estimates for novel objects, that is, objects unseen during training. Such works use deep template matching strategies to retrieve the closest template connected to a query image, which implicitly provides object class and pose. Despite the recent success and improvements of Vision Transformers over CNNs for many vision tasks, the state of the art uses CNN-based approaches for novel object pose estimation. This work evaluates and demonstrates the differences between self-supervised CNNs and Vision Transformers for deep template matching. In detail, both types of approaches are trained using contrastive learning to match training images against rendered templates of isolated objects. At test time such templates are matched against query images of known and novel objects under challenging settings, such as clutter, occlusion and object symmetries, using masked cosine similarity. The presented results not only demonstrate that Vision Transformers improve matching accuracy over CNNs but also that for some cases pre-trained Vision Transformers do not need fine-tuning to achieve the improvement. Furthermore, we highlight the differences in optimization and network architecture when comparing these two types of networks for deep template matching.
引用
收藏
页数:9
相关论文
共 50 条
  • [21] Emerging Properties in Self-Supervised Vision Transformers
    Caron, Mathilde
    Touvron, Hugo
    Misra, Ishan
    Jegou, Herve
    Mairal, Julien
    Bojanowski, Piotr
    Joulin, Armand
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 9630 - 9640
  • [22] Self-supervised vision transformers for semantic segmentation
    Gu, Xianfan
    Hu, Yingdong
    Wen, Chuan
    Gao, Yang
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2025, 251
  • [23] Self-supervised Vision Transformers for Writer Retrieval
    Raven, Tim
    Matei, Arthur
    Fink, Gernot A.
    DOCUMENT ANALYSIS AND RECOGNITION-ICDAR 2024, PT II, 2024, 14805 : 380 - 396
  • [24] Cross-View Self-fusion for Self-supervised 3D Human Pose Estimation in the Wild
    Kim, Hyun-Woo
    Lee, Gun-Hee
    Oh, Myeong-Seok
    Lee, Seong-Whan
    COMPUTER VISION - ACCV 2022, PT I, 2023, 13841 : 193 - 210
  • [25] Self-Supervised Vision Transformers for Malware Detection
    Seneviratne, Sachith
    Shariffdeen, Ridwan
    Rasnayaka, Sanka
    Kasthuriarachchi, Nuran
    IEEE ACCESS, 2022, 10 : 103121 - 103135
  • [26] A Dual-Branch Self-Boosting Framework for Self-Supervised 3D Hand Pose Estimation
    Ren, Pengfei
    Sun, Haifeng
    Hao, Jiachang
    Qi, Qi
    Wang, Jingyu
    Liao, Jianxin
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 5052 - 5066
  • [27] Self-Supervised 3D Hand Pose Estimation from monocular RGB via Contrastive Learning
    Spurr, Adrian
    Dahiya, Aneesh
    Wang, Xi
    Zhang, Xucong
    Hilliges, Otmar
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 11210 - 11219
  • [28] Temporal-Aware Self-Supervised Learning for 3D Hand Pose and Mesh Estimation in Videos
    Chen, Liangjian
    Lin, Shih-Yao
    Xie, Yusheng
    Lin, Yen-Yu
    Xie, Xiaohui
    2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2021), 2021, : 1049 - 1058
  • [29] 3D Packing for Self-Supervised Monocular Depth Estimation
    Guizilini, Vitor
    Ambrus, Rares
    Pillai, Sudeep
    Raventos, Allan
    Gaidon, Adrien
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 2482 - 2491
  • [30] EgoFish3D: Egocentric 3D Pose Estimation From a Fisheye Camera via Self-Supervised Learning
    Liu, Yuxuan
    Yang, Jianxin
    Gu, Xiao
    Chen, Yijun
    Guo, Yao
    Yang, Guang-Zhong
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 8880 - 8891