Depth-Guided Vision Transformer With Normalizing Flows for Monocular 3D Object Detection

被引:0
|
作者
Cong Pan [1 ,2 ]
Junran Peng [3 ]
Zhaoxiang Zhang [4 ,5 ,6 ,7 ]
机构
[1] the Center for Research on Intelligent Perception and Computing(CRIPAC), National Laboratory of Pattern Recognition(NLPR),Institute of Automation, Chinese Academy of Sciences(CASIA)
[2] the School of Future Technology, University of Chinese Academy of Sciences(UCAS)
[3] the Huawei Inc.
[4] IEEE
[5] the Institute of Automation, Chinese Academy of Sciences(CASIA)
[6] the University of Chinese Academy of Sciences(UCAS)
[7] the Centre for Artificial Intelligence and Robotics Hong Kong Institute of Science&Innovation, Chinese Academy of Sciences(HKISI CAS)
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP391.41 [];
学科分类号
080203 ;
摘要
Monocular 3D object detection is challenging due to the lack of accurate depth information. Some methods estimate the pixel-wise depth maps from off-the-shelf depth estimators and then use them as an additional input to augment the RGB images.Depth-based methods attempt to convert estimated depth maps to pseudo-LiDAR and then use LiDAR-based object detectors or focus on the perspective of image and depth fusion learning.However, they demonstrate limited performance and efficiency as a result of depth inaccuracy and complex fusion mode with convolutions. Different from these approaches, our proposed depth-guided vision transformer with a normalizing flows(NF-DVT)network uses normalizing flows to build priors in depth maps to achieve more accurate depth information. Then we develop a novel Swin-Transformer-based backbone with a fusion module to process RGB image patches and depth map patches with two separate branches and fuse them using cross-attention to exchange information with each other. Furthermore, with the help of pixel-wise relative depth values in depth maps, we develop new relative position embeddings in the cross-attention mechanism to capture more accurate sequence ordering of input tokens.Our method is the first Swin-Transformer-based backbone architecture for monocular 3D object detection. The experimental results on the KITTI and the challenging Waymo Open datasets show the effectiveness of our proposed method and superior performance over previous counterparts.
引用
收藏
页码:673 / 689
页数:17
相关论文
共 50 条
  • [1] Depth-Guided Vision Transformer With Normalizing Flows for Monocular 3D Object Detection
    Pan, Cong
    Peng, Junran
    Zhang, Zhaoxiang
    IEEE-CAA JOURNAL OF AUTOMATICA SINICA, 2024, 11 (03) : 673 - 689
  • [2] MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection
    Zhang, Renrui
    Qiu, Han
    Wang, Tai
    Guo, Ziyu
    Cui, Ziteng
    Qiao, Yu
    Li, Hongsheng
    Gao, Peng
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 9121 - 9132
  • [3] Learning Depth-Guided Convolutions for Monocular 3D Object Detection
    Ng, Mingyu
    Huo, Yuqi
    Yi, Hongwei
    Wang, Zhe
    Shi, Jianping
    Lu, Zhiwu
    Luo, Ping
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2020), 2020, : 4306 - 4315
  • [4] Revisiting Depth-guided Methods for Monocular 3D Object Detection by Hierarchical Balanced Depth
    Chen, Yi-Rong
    Tseng, Ching-Yu
    Liou, Yi-Syuan
    Wu, Tsung-Han
    Hsu, Winston H.
    CONFERENCE ON ROBOT LEARNING, VOL 229, 2023, 229
  • [5] CrossDTR: Cross-view and Depth-guided Transformers for 3D Object Detection
    Tseng, Ching-Yu
    Chen, Yi-Rong
    Lee, Hsin-Ying
    Wu, Tsung-Han
    Chen, Wen-Chin
    Hsu, Winston H.
    2023 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, ICRA, 2023, : 4850 - 4857
  • [6] MonoDTR: Monocular 3D Object Detection with Depth-Aware Transformer
    Huang, Kuan-Chih
    Wu, Tsung-Han
    Su, Hung-Ting
    Hsu, Winston H.
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 4002 - 4011
  • [7] Monocular 3-D Object Detection Based on Depth-Guided Local Convolution for Smart Payment in D2D Systems
    Li, Jun
    Song, Wei
    Gao, Yongbin
    Wang, Huixing
    Yan, Yier
    Huang, Bo
    Zhang, Jun
    Wang, Wei
    IEEE INTERNET OF THINGS JOURNAL, 2023, 10 (03) : 2245 - 2254
  • [8] OBJECT-AWARE CALIBRATED DEPTH-GUIDED TRANSFORMER FOR RGB-D CO-SALIENT OBJECT DETECTION
    Wu, Yang
    Liang, Lingyan
    Zhao, Yaqian
    Zhang, Kaihua
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 1121 - 1126
  • [9] Monocular 3D Object Detection with Depth from Motion
    Wang, Tai
    Pang, Jiangmiao
    Lin, Dahua
    COMPUTER VISION, ECCV 2022, PT IX, 2022, 13669 : 386 - 403
  • [10] Depth-Guided Progressive Network for Object Detection
    Ma, Jia-Wei
    Liang, Min
    Chen, Song-Lu
    Chen, Feng
    Tian, Shu
    Qin, Jingyan
    Yin, Xu-Cheng
    IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2022, 23 (10) : 19523 - 19533