Depth-Guided Vision Transformer With Normalizing Flows for Monocular 3D Object Detection

被引:0
|
作者
Cong Pan [1 ,2 ]
Junran Peng [3 ]
Zhaoxiang Zhang [4 ,5 ,6 ,7 ]
机构
[1] the Center for Research on Intelligent Perception and Computing(CRIPAC), National Laboratory of Pattern Recognition(NLPR),Institute of Automation, Chinese Academy of Sciences(CASIA)
[2] the School of Future Technology, University of Chinese Academy of Sciences(UCAS)
[3] the Huawei Inc.
[4] IEEE
[5] the Institute of Automation, Chinese Academy of Sciences(CASIA)
[6] the University of Chinese Academy of Sciences(UCAS)
[7] the Centre for Artificial Intelligence and Robotics Hong Kong Institute of Science&Innovation, Chinese Academy of Sciences(HKISI CAS)
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP391.41 [];
学科分类号
080203 ;
摘要
Monocular 3D object detection is challenging due to the lack of accurate depth information. Some methods estimate the pixel-wise depth maps from off-the-shelf depth estimators and then use them as an additional input to augment the RGB images.Depth-based methods attempt to convert estimated depth maps to pseudo-LiDAR and then use LiDAR-based object detectors or focus on the perspective of image and depth fusion learning.However, they demonstrate limited performance and efficiency as a result of depth inaccuracy and complex fusion mode with convolutions. Different from these approaches, our proposed depth-guided vision transformer with a normalizing flows(NF-DVT)network uses normalizing flows to build priors in depth maps to achieve more accurate depth information. Then we develop a novel Swin-Transformer-based backbone with a fusion module to process RGB image patches and depth map patches with two separate branches and fuse them using cross-attention to exchange information with each other. Furthermore, with the help of pixel-wise relative depth values in depth maps, we develop new relative position embeddings in the cross-attention mechanism to capture more accurate sequence ordering of input tokens.Our method is the first Swin-Transformer-based backbone architecture for monocular 3D object detection. The experimental results on the KITTI and the challenging Waymo Open datasets show the effectiveness of our proposed method and superior performance over previous counterparts.
引用
收藏
页码:673 / 689
页数:17
相关论文
共 50 条
  • [41] Depth-conditioned Dynamic Message Propagation for Monocular 3D Object Detection
    Wang, Li
    Du, Liang
    Ye, Xiaoqing
    Fu, Yanwei
    Guo, Guodong
    Xue, Xiangyang
    Feng, Jianfeng
    Zhang, Li
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 454 - 463
  • [42] DG-Recon: Depth-Guided Neural 3D Scene Reconstruction
    Ju, Jihong
    Tseng, Ching-Wei
    Bailo, Oleksandr
    Dikov, Georgi
    Ghafoorian, Mohsen
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 18138 - 18148
  • [43] eGAC3D: enhancing depth adaptive convolution and depth estimation for monocular 3D object pose detection
    Ngo, Duc Tuan
    Bui, Minh-Quan Viet
    Nguyen, Duc Dung
    Pham, Hoang-Anh
    PEERJ COMPUTER SCIENCE, 2022, 8
  • [44] Monocular 3D Object Detection for Autonomous Driving
    Chen, Xiaozhi
    Kundu, Kaustav
    Zhang, Ziyu
    Ma, Huimin
    Fidler, Sanja
    Urtasun, Raquel
    2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 2147 - 2156
  • [45] Dimension Embeddings for Monocular 3D Object Detection
    Zhang, Yunpeng
    Zheng, Wenzhao
    Zhu, Zheng
    Huang, Guan
    Du, Dalong
    Zhou, Jie
    Lu, Jiwen
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 1579 - 1588
  • [46] Confidence Guided Stereo 3D Object Detection with Split Depth Estimation
    Li, Chengyao
    Ku, Jason
    Waslander, Steven L.
    2020 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2020, : 5776 - 5783
  • [47] Multivariate Probabilistic Monocular 3D Object Detection
    Shi, Xuepeng
    Chen, Zhixiang
    Kim, Tae-Kyun
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 4270 - 4279
  • [48] Uncertainty Prediction for Monocular 3D Object Detection
    Mun, Junghwan
    Choi, Hyukdoo
    SENSORS, 2023, 23 (12)
  • [49] Monocular 3D object detection for distant objects
    Li, Jiahao
    Han, Xiaohong
    JOURNAL OF ELECTRONIC IMAGING, 2024, 33 (03) : 33021
  • [50] Homography Loss for Monocular 3D Object Detection
    Gu, Jiaqi
    Wu, Bojian
    Fan, Lubin
    Huang, Jianqiang
    Cao, Shen
    Xiang, Zhiyu
    Hua, Xian-Sheng
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 1070 - 1079