Depth-Guided Vision Transformer With Normalizing Flows for Monocular 3D Object Detection

被引:0
|
作者
Cong Pan [1 ,2 ]
Junran Peng [3 ]
Zhaoxiang Zhang [4 ,5 ,6 ,7 ]
机构
[1] the Center for Research on Intelligent Perception and Computing(CRIPAC), National Laboratory of Pattern Recognition(NLPR),Institute of Automation, Chinese Academy of Sciences(CASIA)
[2] the School of Future Technology, University of Chinese Academy of Sciences(UCAS)
[3] the Huawei Inc.
[4] IEEE
[5] the Institute of Automation, Chinese Academy of Sciences(CASIA)
[6] the University of Chinese Academy of Sciences(UCAS)
[7] the Centre for Artificial Intelligence and Robotics Hong Kong Institute of Science&Innovation, Chinese Academy of Sciences(HKISI CAS)
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP391.41 [];
学科分类号
080203 ;
摘要
Monocular 3D object detection is challenging due to the lack of accurate depth information. Some methods estimate the pixel-wise depth maps from off-the-shelf depth estimators and then use them as an additional input to augment the RGB images.Depth-based methods attempt to convert estimated depth maps to pseudo-LiDAR and then use LiDAR-based object detectors or focus on the perspective of image and depth fusion learning.However, they demonstrate limited performance and efficiency as a result of depth inaccuracy and complex fusion mode with convolutions. Different from these approaches, our proposed depth-guided vision transformer with a normalizing flows(NF-DVT)network uses normalizing flows to build priors in depth maps to achieve more accurate depth information. Then we develop a novel Swin-Transformer-based backbone with a fusion module to process RGB image patches and depth map patches with two separate branches and fuse them using cross-attention to exchange information with each other. Furthermore, with the help of pixel-wise relative depth values in depth maps, we develop new relative position embeddings in the cross-attention mechanism to capture more accurate sequence ordering of input tokens.Our method is the first Swin-Transformer-based backbone architecture for monocular 3D object detection. The experimental results on the KITTI and the challenging Waymo Open datasets show the effectiveness of our proposed method and superior performance over previous counterparts.
引用
收藏
页码:673 / 689
页数:17
相关论文
共 50 条
  • [31] Task-Aware Monocular Depth Estimation for 3D Object Detection
    Wang, Xinlong
    Yin, Wei
    Kong, Tao
    Jiang, Yuning
    Li, Lei
    Shen, Chunhua
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 12257 - 12264
  • [32] Bridged Transformer for Vision and Point Cloud 3D Object Detection
    Wang, Yikai
    Ye, TengQi
    Cao, Lele
    Huang, Wenbing
    Sun, Fuchun
    He, Fengxiang
    Tao, Dacheng
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 12104 - 12113
  • [33] Boosting Monocular 3D Object Detection With Object-Centric Auxiliary Depth Supervision
    Kim, Youngseok
    Kim, Sanmin
    Sim, Sangmin
    Choi, Jun Won
    Kum, Dongsuk
    IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2023, 24 (02) : 1801 - 1813
  • [34] Geometry-Guided Domain Generalization for Monocular 3D Object Detection
    Yang, Fan
    Chen, Hui
    He, Yuwei
    Zhao, Sicheng
    Zhang, Chenghao
    Ni, Kai
    Ding, Guiguang
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 6, 2024, : 6467 - 6476
  • [35] Aerial Monocular 3D Object Detection
    Hu, Yue
    Fang, Shaoheng
    Xie, Weidi
    Chen, Siheng
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2023, 8 (04) : 1959 - 1966
  • [36] Disentangling Monocular 3D Object Detection
    Simonelli, Andrea
    Bulo, Samuel Rota
    Porzi, Lorenzo
    Lopez-Antequera, Manuel
    Kontschieder, Peter
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 1991 - 1999
  • [37] DID-M3D: Decoupling Instance Depth for Monocular 3D Object Detection
    Peng, Liang
    Wu, Xiaopei
    Yang, Zheng
    Liu, Haifeng
    Cai, Deng
    COMPUTER VISION - ECCV 2022, PT I, 2022, 13661 : 71 - 88
  • [38] FCOS3Dformer: enhancing monocular 3D object detection through transformer-assisted fusion of depth information
    Hao, Bingsen
    Deng, Zhaoxue
    Liu, Mingze
    Liu, Can
    International Journal of Vehicle Systems Modelling and Testing, 2024, 18 (03) : 228 - 244
  • [39] MoGDE: Boosting Mobile Monocular 3D Object Detection with Ground Depth Estimation
    Zhou, Yunsong
    Liu, Quan
    Zhu, Hongzi
    Li, Yunzhe
    Chang, Shan
    Guo, Minyi
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [40] Monocular 3D Object Detection With Sequential Feature Association and Depth Hint Augmentation
    Gao, Tianze
    Pan, Huihui
    Gao, Huijun
    IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, 2022, 7 (02): : 240 - 250