Transformer-Based Model for Monocular Visual Odometry: A Video Understanding Approach

被引：1

作者：

Francani, Andre O. ^{[1
]}

Maximo, Marcos R. O. A. ^{[1
]}

机构：

[1] Aeronaut Inst Technol, Autonomous Computat Syst Lab, BR-12228900 Sao Jose Dos Campos, SP, Brazil

来源：

IEEE ACCESS | 2025年 / 13卷

关键词：

Transformers; Visual odometry; Feature extraction; Deep learning; Computer architecture; 6-DOF; Pipelines; Odometry; Vectors; Context modeling; monocular visual odometry; transformer; video understanding;

D O I：

10.1109/ACCESS.2025.3531667

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Estimating the camera's pose given images from a single camera is a traditional task in mobile robots and autonomous vehicles. This problem is called monocular visual odometry and often relies on geometric approaches that require considerable engineering effort for a specific scenario. Deep learning methods have been shown to be generalizable after proper training and with a large amount of available data. Transformer-based architectures have dominated the state-of-the-art in natural language processing and computer vision tasks, such as image and video understanding. In this work, we deal with the monocular visual odometry as a video understanding task to estimate the 6 degrees of freedom of a camera's pose. We contribute by presenting the TSformer-VO model based on spatio-temporal self-attention mechanisms to extract features from clips and estimate the motions in an end-to-end manner. Our approach achieved competitive state-of-the-art performance compared with geometry-based and deep learning-based methods on the KITTI visual odometry dataset, outperforming the DeepVO implementation highly accepted in the visual odometry community. The code is publicly available at https://github.com/aofrancani/TSformer-VO.

引用

页码：13959 / 13971

页数：13

共 50 条

[41] Overview of Transformer-Based Visual Segmentation Techniques
Li, Wen-Sheng
Zhang, Jing
Zhuo, Li
Wu, Xin-Jia
Yan, Yi
Jisuanji Xuebao/Chinese Journal of Computers, 2024, 47 (12): : 2760 - 2782
[42] Transformer-Based Approach to Melanoma Detection
Cirrincione, Giansalvo
Cannata, Sergio
Cicceri, Giovanni
Prinzi, Francesco
Currieri, Tiziana
Lovino, Marta
Militello, Carmelo
Pasero, Eros
Vitabile, Salvatore
SENSORS, 2023, 23 (12)
[43] Transformer-based approach to variable typing
Rey, Charles Arthel
Danguilan, Jose Lorenzo
Mendoza, Karl Patrick
Remolona, Miguel Francisco
HELIYON, 2023, 9 (10)
[44] Monocular Non-linear Photometric Transformation Visual Odometry Based on Direct Sparse Odometry
Yuan, Junyi
Hirota, Kaoru
Zhang, Zelong
Dai, Yaping
2023 35TH CHINESE CONTROL AND DECISION CONFERENCE, CCDC, 2023, : 2682 - 2687
[45] Video text tracking with transformer-based local search
Zhou, Xingsheng
Wang, Cheng
Wang, Xinggang
Liu, Wenyu
NEUROCOMPUTING, 2024, 609
[46] Swin transformer-based traffic video text tracking
Yu, Jinyao
Qian, Jiangbo
Xin, Yu
Wang, Chong
Dong, Yihong
APPLIED INTELLIGENCE, 2024, 54 (21) : 10581 - 10595
[47] SuperVO: A Monocular Visual Odometry based on Learned Feature Matching with GNN
Rao, Shi
2021 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS AND COMPUTER ENGINEERING (ICCECE), 2021, : 18 - 26
[48] An Unsupervised Monocular Visual Odometry Based on Multi-Scale Modeling
Zhi, Henghui
Yin, Chenyang
Li, Huibin
Pang, Shanmin
SENSORS, 2022, 22 (14)
[49] Experimental Evaluation of Direct Monocular Visual Odometry Based on Nonlinear Optimization
Liang, Jian
Cheng, Xin
He, Yezhou
Li, Xiaoli
Liu, Huashan
2019 WORLD ROBOT CONFERENCE SYMPOSIUM ON ADVANCED ROBOTICS AND AUTOMATION (WRC SARA 2019), 2019, : 291 - 295
[50] Semi-Direct Monocular Visual Odometry Based on Visual-Inertial Fusion
Gong Z.
Zhang X.
Peng X.
Li X.
Zhang, Xiaoli (zhxl@xmu.edu.cn), 1600, Chinese Academy of Sciences (42): : 595 - 605

← 1 2 3 4 5 →