Multimodal Monocular Dense Depth Estimation with Event-Frame Fusion Using Transformer

被引:0
|
作者
Xiao, Baihui [1 ]
Xu, Jingzehua [1 ]
Zhang, Zekai [1 ]
Xing, Tianyu [1 ]
Wang, Jingjing [2 ]
Ren, Yong [3 ]
机构
[1] Tsinghua Univ, Tsinghua Shenzhen Int Grad Sch, Shenzhen, Peoples R China
[2] Beihang Univ, Sch Cyber Sci & Technol, Beijing, Peoples R China
[3] Tsinghua Univ, Dept Elect Engn, Beijing, Peoples R China
来源
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING-ICANN 2024, PT II | 2024年 / 15017卷
基金
中国国家自然科学基金;
关键词
Frame Camera; Event Camera; Multi-modal Fusion; Transformer self-attention; Monocular depth estimation; VISION;
D O I
10.1007/978-3-031-72335-3_29
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Frame cameras struggle to estimate depth maps accurately under abnormal lighting conditions. In contrast, event cameras, with their high temporal resolution and high dynamic range, can capture sparse, asynchronous event streams that record pixel brightness changes, addressing the limitations of frame cameras. However, the potential of asynchronous events remains underexploited, which hinders the ability of event cameras to predict dense depth maps effectively. Integrating event streams with frame data can significantly enhance the monocular depth estimation accuracy, especially in complex scenarios. In this study, we introduce a novel depth estimation framework that combines event and frame data using a transformer-based model. Our proposed framework contains two primary components: a multimodal encoder and a joint decoder. The multimodal encoder employs self-attention mechanisms to analyze the interactions between frame patches and event tensors, mapping out dependencies across local and global spatiotemporal events. This multi-scale fusion approach maximizes the benefits of both event and frame inputs. The joint decoder incorporates a dual-phase, triple-scale feature fusion module, which extracts contextual information and delivers detailed depth prediction results. Our experimental results on the EventScape and MVSEC datasets affirm that our method sets a new benchmark in performance.
引用
收藏
页码:419 / 433
页数:15
相关论文
共 50 条
  • [31] Event-Based Monocular Depth Estimation With Recurrent Transformers
    Liu, Xu
    Li, Jianing
    Shi, Jinqiao
    Fan, Xiaopeng
    Tian, Yonghong
    Zhao, Debin
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (08) : 7417 - 7429
  • [32] Self-supervised multi-frame depth estimation with visual-inertial pose transformer and monocular guidance
    Wang, Xiang
    Luo, Haonan
    Wang, Zihang
    Zheng, Jin
    Bai, Xiao
    INFORMATION FUSION, 2024, 108
  • [33] Monocular Depth Estimation Based on Multi-Scale Depth Map Fusion
    Yang, Xin
    Chang, Qingling
    Liu, Xinglin
    He, Siyuan
    Cui, Yan
    IEEE ACCESS, 2021, 9 : 67696 - 67705
  • [34] Novel Hybrid Neural Network for Dense Depth Estimation using On-Board Monocular Images
    Jia, Shaocheng
    Pei, Xin
    Yang, Zi
    Tian, Shan
    Yue, Yun
    TRANSPORTATION RESEARCH RECORD, 2020, 2674 (12) : 312 - 323
  • [35] Depth estimation of monocular video using non-parametric fusion of multiple cues
    Liu, Tianliang
    Mo, Yiming
    Xu, Gaobang
    Dai, Xiubin
    Zhu, Xiuchang
    Luo, Jiebo
    Dongnan Daxue Xuebao (Ziran Kexue Ban)/Journal of Southeast University (Natural Science Edition), 2015, 45 (05): : 834 - 839
  • [36] Lightweight Monocular Depth Estimation via Token-Sharing Transformer
    Lee, Dong-Jae
    Lee, Jae Young
    Shon, Hyunguk
    Yi, Eojindl
    Park, Yeong-Hun
    Cho, Sung-Sik
    Kim, Junmo
    2023 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, ICRA, 2023, : 4895 - 4901
  • [37] Self-supervised Learning for Dense Depth Estimation in Monocular Endoscopy
    Liu, Xingtong
    Sinha, Ayushi
    Unberath, Mathias
    Ishii, Masaru
    Hager, Gregory D.
    Taylor, Russell H.
    Reiter, Austin
    OR 2.0 CONTEXT-AWARE OPERATING THEATERS, COMPUTER ASSISTED ROBOTIC ENDOSCOPY, CLINICAL IMAGE-BASED PROCEDURES, AND SKIN IMAGE ANALYSIS, OR 2.0 2018, 2018, 11041 : 128 - 138
  • [38] MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer
    Zhao, Chaoqiang
    Zhang, Youmin
    Poggi, Matteo
    Tosi, Fabio
    Guo, Xianda
    Zhu, Zheng
    Huang, Guan
    Tang, Yang
    Mattoccia, Stefano
    2022 INTERNATIONAL CONFERENCE ON 3D VISION, 3DV, 2022, : 668 - 678
  • [39] Underwater Monocular Depth Estimation Based on Physical-Guided Transformer
    Wang, Chen
    Xu, Haiyong
    Jiang, Gangyi
    Yu, Mei
    Luo, Ting
    Chen, Yeyao
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62 : 18 - 18
  • [40] Unsupervised Ego-Motion and Dense Depth Estimation with Monocular Video
    Xu, Yufan
    Wang, Yan
    Guo, Lei
    2018 IEEE 18TH INTERNATIONAL CONFERENCE ON COMMUNICATION TECHNOLOGY (ICCT), 2018, : 1306 - 1310