Multimodal Monocular Dense Depth Estimation with Event-Frame Fusion Using Transformer

被引：0

作者：

Xiao, Baihui ^{[1
]}

Xu, Jingzehua ^{[1
]}

Zhang, Zekai ^{[1
]}

Xing, Tianyu ^{[1
]}

Wang, Jingjing ^{[2
]}

Ren, Yong ^{[3
]}

机构：

[1] Tsinghua Univ, Tsinghua Shenzhen Int Grad Sch, Shenzhen, Peoples R China

[2] Beihang Univ, Sch Cyber Sci & Technol, Beijing, Peoples R China

[3] Tsinghua Univ, Dept Elect Engn, Beijing, Peoples R China

来源：

ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING-ICANN 2024, PT II | 2024年 / 15017卷

基金：

中国国家自然科学基金;

关键词：

Frame Camera; Event Camera; Multi-modal Fusion; Transformer self-attention; Monocular depth estimation; VISION;

D O I：

10.1007/978-3-031-72335-3_29

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Frame cameras struggle to estimate depth maps accurately under abnormal lighting conditions. In contrast, event cameras, with their high temporal resolution and high dynamic range, can capture sparse, asynchronous event streams that record pixel brightness changes, addressing the limitations of frame cameras. However, the potential of asynchronous events remains underexploited, which hinders the ability of event cameras to predict dense depth maps effectively. Integrating event streams with frame data can significantly enhance the monocular depth estimation accuracy, especially in complex scenarios. In this study, we introduce a novel depth estimation framework that combines event and frame data using a transformer-based model. Our proposed framework contains two primary components: a multimodal encoder and a joint decoder. The multimodal encoder employs self-attention mechanisms to analyze the interactions between frame patches and event tensors, mapping out dependencies across local and global spatiotemporal events. This multi-scale fusion approach maximizes the benefits of both event and frame inputs. The joint decoder incorporates a dual-phase, triple-scale feature fusion module, which extracts contextual information and delivers detailed depth prediction results. Our experimental results on the EventScape and MVSEC datasets affirm that our method sets a new benchmark in performance.

引用

页码：419 / 433

页数：15

共 50 条

[31] Event-Based Monocular Depth Estimation With Recurrent Transformers
Liu, Xu
Li, Jianing
Shi, Jinqiao
Fan, Xiaopeng
Tian, Yonghong
Zhao, Debin
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (08) : 7417 - 7429
[32] Self-supervised multi-frame depth estimation with visual-inertial pose transformer and monocular guidance
Wang, Xiang
Luo, Haonan
Wang, Zihang
Zheng, Jin
Bai, Xiao
INFORMATION FUSION, 2024, 108
[33] Monocular Depth Estimation Based on Multi-Scale Depth Map Fusion
Yang, Xin
Chang, Qingling
Liu, Xinglin
He, Siyuan
Cui, Yan
IEEE ACCESS, 2021, 9 : 67696 - 67705
[34] Novel Hybrid Neural Network for Dense Depth Estimation using On-Board Monocular Images
Jia, Shaocheng
Pei, Xin
Yang, Zi
Tian, Shan
Yue, Yun
TRANSPORTATION RESEARCH RECORD, 2020, 2674 (12) : 312 - 323
[35] Depth estimation of monocular video using non-parametric fusion of multiple cues
Liu, Tianliang
Mo, Yiming
Xu, Gaobang
Dai, Xiubin
Zhu, Xiuchang
Luo, Jiebo
Dongnan Daxue Xuebao (Ziran Kexue Ban)/Journal of Southeast University (Natural Science Edition), 2015, 45 (05): : 834 - 839
[36] Lightweight Monocular Depth Estimation via Token-Sharing Transformer
Lee, Dong-Jae
Lee, Jae Young
Shon, Hyunguk
Yi, Eojindl
Park, Yeong-Hun
Cho, Sung-Sik
Kim, Junmo
2023 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, ICRA, 2023, : 4895 - 4901
[37] Self-supervised Learning for Dense Depth Estimation in Monocular Endoscopy
Liu, Xingtong
Sinha, Ayushi
Unberath, Mathias
Ishii, Masaru
Hager, Gregory D.
Taylor, Russell H.
Reiter, Austin
OR 2.0 CONTEXT-AWARE OPERATING THEATERS, COMPUTER ASSISTED ROBOTIC ENDOSCOPY, CLINICAL IMAGE-BASED PROCEDURES, AND SKIN IMAGE ANALYSIS, OR 2.0 2018, 2018, 11041 : 128 - 138
[38] MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer
Zhao, Chaoqiang
Zhang, Youmin
Poggi, Matteo
Tosi, Fabio
Guo, Xianda
Zhu, Zheng
Huang, Guan
Tang, Yang
Mattoccia, Stefano
2022 INTERNATIONAL CONFERENCE ON 3D VISION, 3DV, 2022, : 668 - 678
[39] Underwater Monocular Depth Estimation Based on Physical-Guided Transformer
Wang, Chen
Xu, Haiyong
Jiang, Gangyi
Yu, Mei
Luo, Ting
Chen, Yeyao
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62 : 18 - 18
[40] Unsupervised Ego-Motion and Dense Depth Estimation with Monocular Video
Xu, Yufan
Wang, Yan
Guo, Lei
2018 IEEE 18TH INTERNATIONAL CONFERENCE ON COMMUNICATION TECHNOLOGY (ICCT), 2018, : 1306 - 1310

← 1 2 3 4 5 →