Transformer-Based Model for Monocular Visual Odometry: A Video Understanding Approach

被引:1
|
作者
Francani, Andre O. [1 ]
Maximo, Marcos R. O. A. [1 ]
机构
[1] Aeronaut Inst Technol, Autonomous Computat Syst Lab, BR-12228900 Sao Jose Dos Campos, SP, Brazil
来源
IEEE ACCESS | 2025年 / 13卷
关键词
Transformers; Visual odometry; Feature extraction; Deep learning; Computer architecture; 6-DOF; Pipelines; Odometry; Vectors; Context modeling; monocular visual odometry; transformer; video understanding;
D O I
10.1109/ACCESS.2025.3531667
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Estimating the camera's pose given images from a single camera is a traditional task in mobile robots and autonomous vehicles. This problem is called monocular visual odometry and often relies on geometric approaches that require considerable engineering effort for a specific scenario. Deep learning methods have been shown to be generalizable after proper training and with a large amount of available data. Transformer-based architectures have dominated the state-of-the-art in natural language processing and computer vision tasks, such as image and video understanding. In this work, we deal with the monocular visual odometry as a video understanding task to estimate the 6 degrees of freedom of a camera's pose. We contribute by presenting the TSformer-VO model based on spatio-temporal self-attention mechanisms to extract features from clips and estimate the motions in an end-to-end manner. Our approach achieved competitive state-of-the-art performance compared with geometry-based and deep learning-based methods on the KITTI visual odometry dataset, outperforming the DeepVO implementation highly accepted in the visual odometry community. The code is publicly available at https://github.com/aofrancani/TSformer-VO.
引用
收藏
页码:13959 / 13971
页数:13
相关论文
共 50 条
  • [1] Transformer-Based Self-Supervised Monocular Depth and Visual Odometry
    Zhao, Hongru
    Qiao, Xiuquan
    Ma, Yi
    Tafazolli, Rahim
    IEEE SENSORS JOURNAL, 2023, 23 (02) : 1436 - 1446
  • [2] SWformer-VO: A Monocular Visual Odometry Model Based on Swin Transformer
    Wu, Zhigang
    Zhu, Yaohui
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2024, 9 (05) : 4766 - 4773
  • [3] A Neural ODE and Transformer-based Model for Temporal Understanding and Dense Video Captioning
    Artham, Sainithin
    Shaikh, Soharab Hossain
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (23) : 64037 - 64056
  • [4] TRANSFORMER-BASED APPROACH FOR DOCUMENT LAYOUT UNDERSTANDING
    Yang, Huichen
    Hsu, William
    2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 4043 - 4047
  • [5] Dense Prediction Transformer for Scale Estimation in Monocular Visual Odometry
    Francani, Andre O.
    Maximo, Marcos R. O. A.
    2022 LATIN AMERICAN ROBOTICS SYMPOSIUM (LARS), 2022 BRAZILIAN SYMPOSIUM ON ROBOTICS (SBR), AND 2022 WORKSHOP ON ROBOTICS IN EDUCATION (WRE), 2022, : 312 - 317
  • [6] Vision transformer-based visual language understanding of the construction process
    Yang, Bin
    Zhang, Binghan
    Han, Yilong
    Liu, Boda
    Hu, Jiniming
    Jin, Yiming
    ALEXANDRIA ENGINEERING JOURNAL, 2024, 99 : 242 - 256
  • [7] Monocular Visual Odometry Based on Hybrid Parameterization
    Mohamed, Sherif A. S.
    Haghbayan, Mohammad-Hashem
    Heikkonen, Jukka
    Tenhunen, Hannu
    Plosila, Juha
    TWELFTH INTERNATIONAL CONFERENCE ON MACHINE VISION (ICMV 2019), 2020, 11433
  • [8] From Local Understanding to Global Regression in Monocular Visual Odometry
    Esfahani, Mandi Abolfazli
    Wu, Keyu
    Yuan, Shenghai
    Wang, Han
    INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2020, 34 (01)
  • [9] A Novel Approach to Improve the Precision of Monocular Visual Odometry
    Xiao, Chen
    Zhu, Xiaorui
    Feng, Wei
    Ou, Yongsheng
    2015 IEEE INTERNATIONAL CONFERENCE ON INFORMATION AND AUTOMATION, 2015, : 392 - 397
  • [10] LVBERT: Transformer-Based Model for Latvian Language Understanding
    Znotins, Arturs
    Barzdins, Guntis
    HUMAN LANGUAGE TECHNOLOGIES - THE BALTIC PERSPECTIVE (HLT 2020), 2020, 328 : 111 - 115