Multimodal Monocular Dense Depth Estimation with Event-Frame Fusion Using Transformer

被引：0

作者：

Xiao, Baihui ^{[1
]}

Xu, Jingzehua ^{[1
]}

Zhang, Zekai ^{[1
]}

Xing, Tianyu ^{[1
]}

Wang, Jingjing ^{[2
]}

Ren, Yong ^{[3
]}

机构：

[1] Tsinghua Univ, Tsinghua Shenzhen Int Grad Sch, Shenzhen, Peoples R China

[2] Beihang Univ, Sch Cyber Sci & Technol, Beijing, Peoples R China

[3] Tsinghua Univ, Dept Elect Engn, Beijing, Peoples R China

来源：

ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING-ICANN 2024, PT II | 2024年 / 15017卷

基金：

中国国家自然科学基金;

关键词：

Frame Camera; Event Camera; Multi-modal Fusion; Transformer self-attention; Monocular depth estimation; VISION;

D O I：

10.1007/978-3-031-72335-3_29

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Frame cameras struggle to estimate depth maps accurately under abnormal lighting conditions. In contrast, event cameras, with their high temporal resolution and high dynamic range, can capture sparse, asynchronous event streams that record pixel brightness changes, addressing the limitations of frame cameras. However, the potential of asynchronous events remains underexploited, which hinders the ability of event cameras to predict dense depth maps effectively. Integrating event streams with frame data can significantly enhance the monocular depth estimation accuracy, especially in complex scenarios. In this study, we introduce a novel depth estimation framework that combines event and frame data using a transformer-based model. Our proposed framework contains two primary components: a multimodal encoder and a joint decoder. The multimodal encoder employs self-attention mechanisms to analyze the interactions between frame patches and event tensors, mapping out dependencies across local and global spatiotemporal events. This multi-scale fusion approach maximizes the benefits of both event and frame inputs. The joint decoder incorporates a dual-phase, triple-scale feature fusion module, which extracts contextual information and delivers detailed depth prediction results. Our experimental results on the EventScape and MVSEC datasets affirm that our method sets a new benchmark in performance.

引用

页码：419 / 433

页数：15

共 50 条

[1] Monocular Dense Reconstruction by Depth Estimation Fusion
Chen, Tian
Ding, Wendong
Zhang, Dapeng
Liu, Xilong
PROCEEDINGS OF THE 30TH CHINESE CONTROL AND DECISION CONFERENCE (2018 CCDC), 2018, : 4460 - 4465
[2] Lightweight monocular depth estimation using a fusion-improved transformer
Sui, Xin
Gao, Song
Xu, Aigong
Zhang, Cong
Wang, Changqiang
Shi, Zhengxu
SCIENTIFIC REPORTS, 2024, 14 (01):
[3] GSDC Transformer: An Efficient and Effective Cue Fusion for Monocular Multi-Frame Depth Estimation
Fang, Naiyu
Qiu, Lemiao
Zhang, Shuyou
Wang, Zili
Zhou, Zheyuan
Hu, Kerui
IEEE ROBOTICS AND AUTOMATION LETTERS, 2024, 9 (03) : 2256 - 2263
[4] Unsupervised Monocular Depth Estimation Based on Dense Feature Fusion
Chen Ying
Wang Yiliang
JOURNAL OF ELECTRONICS & INFORMATION TECHNOLOGY, 2021, 43 (10) : 2976 - 2984
[5] Dense monocular depth estimation for stereoscopic vision based on pyramid transformer and multi-scale feature fusion
Xia, Zhongyi
Wu, Tianzhao
Wang, Zhuoyan
Zhou, Man
Wu, Boqi
Chan, C. Y.
Kong, Ling Bing
SCIENTIFIC REPORTS, 2024, 14 (01)
[6] Dense monocular depth estimation for stereoscopic vision based on pyramid transformer and multi-scale feature fusion
Zhongyi Xia
Tianzhao Wu
Zhuoyan Wang
Man Zhou
Boqi Wu
C. Y. Chan
Ling Bing Kong
Scientific Reports, 14
[7] Residual Vision Transformer and Adaptive Fusion Autoencoders for Monocular Depth Estimation
Yang, Wei-Jong
Wu, Chih-Chen
Yang, Jar-Ferr
SENSORS, 2025, 25 (01)
[8] A Contour-Aware Monocular Depth Estimation Network Using Swin Transformer and Cascaded Multiscale Fusion
Li, Tao
Zhang, Yi
IEEE SENSORS JOURNAL, 2024, 24 (08) : 13620 - 13628
[9] Monocular depth estimation based on dense connections
Wang, Quande
Cheng, Kai
Huazhong Keji Daxue Xuebao (Ziran Kexue Ban)/Journal of Huazhong University of Science and Technology (Natural Science Edition), 2023, 51 (11): : 75 - 82
[10] DEPTHFORMER: MULTISCALE VISION TRANSFORMER FOR MONOCULAR DEPTH ESTIMATION WITH GLOBAL LOCAL INFORMATION FUSION
Agarwal, Ashutosh
Arora, Chetan
2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 3873 - 3877

← 1 2 3 4 5 →