Look Around and Refer: 2D Synthetic Semantics Knowledge Distillation for 3D Visual Grounding

被引:0
|
作者
Bakr, Eslam Mohamed [1 ]
Alsaedy, Yasmeen [1 ]
Elhoseiny, Mohamed [1 ]
机构
[1] King Abdullah Univ Sci & Technol KAUST, Thuwal, Saudi Arabia
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The 3D visual grounding task has been explored with visual and language streams comprehending referential language to identify target objects in 3D scenes. However, most existing methods devote the visual stream to capturing the 3D visual clues using off-the-shelf point clouds encoders. The main question we address in this paper is "can we consolidate the 3D visual stream by 2D clues synthesized from point clouds and efficiently utilize them in training and testing?". The main idea is to assist the 3D encoder by incorporating rich 2D object representations without requiring extra 2D inputs. To this end, we leverage 2D clues, synthetically generated from 3D point clouds, and empirically show their aptitude to boost the quality of the learned visual representations. We validate our approach through comprehensive experiments on Nr3D, Sr3D, and ScanRefer datasets and show consistent performance gains compared to existing methods. Our proposed module, dubbed as Look Around and Refer (LAR), significantly outperforms the state-of-the-art 3D visual grounding techniques on three benchmarks, i.e., Nr3D, Sr3D, and ScanRefer. The code is available at https://eslambakr.github.io/LAR.github.io/.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] SAT: 2D Semantics Assisted Training for 3D Visual Grounding
    Yang, Zhengyuan
    Zhang, Songyang
    Wang, Liwei
    Luo, Jiebo
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1836 - 1846
  • [2] 2D to 3D Magnetism in Synthetic Micas
    Rosas-Huerta, Jose Luis
    Wolber, Jonas
    Minaud, Claire
    Fabelo, Oscar
    Ritter, Clemens
    Mentre, Olivier
    Arevalo-Lopez, Angel M.
    ADVANCED SCIENCE, 2024, 11 (42)
  • [3] The synthetic collimator for 2D and 3D imaging
    Clarkson, E
    Wilson, DW
    Barrett, HH
    MEDICAL IMAGING 1999: PHYSICS OF MEDICAL IMAGING, PTS 1 AND 2, 1999, 3659 : 107 - 117
  • [4] 3D RECONSTRUCTION OF A 2D VISUAL DISPLAY
    BROWN, LB
    JOURNAL OF GENETIC PSYCHOLOGY, 1969, 115 (02): : 257 - &
  • [5] Understanding Pixel-Level 2D Image Semantics With 3D Keypoint Knowledge Engine
    You, Yang
    Li, Chengkun
    Lou, Yujing
    Cheng, Zhoujun
    Li, Liangwei
    Ma, Lizhuang
    Wang, Weiming
    Lu, Cewu
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (09) : 5780 - 5795
  • [6] Is the deformation around the MCT, 2D or 3D deformation?
    Hayashi, Daigoro
    JOURNAL OF HIMALAYAN EARTH SCIENCES, 2011, 44 (01): : 27 - 28
  • [7] Refer-it-in-RGBD: A Bottom-up Approach for 3D Visual Grounding in RGBD Images
    Liu, Haolin
    Lin, Anran
    Han, Xiaoguang
    Yang, Lei
    Yu, Yizhou
    Cui, Shuguang
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 6028 - 6037
  • [8] Combining 2D to 2D and 3D to 2D Point Correspondences for Stereo Visual Odometry
    Manthe, Stephan
    Carrio, Adrian
    Neuhaus, Frank
    Campoy, Pascual
    Paulus, Dietrich
    PROCEEDINGS OF THE 13TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER VISION, IMAGING AND COMPUTER GRAPHICS THEORY AND APPLICATIONS (VISIGRAPP 2018), VOL 5: VISAPP, 2018, : 455 - 463
  • [9] Grounding 3D Object Affordance from 2D Interactions in Images
    Yang, Yuhang
    Zhai, Wei
    Luo, Hongchen
    Cao, Yang
    Luo, Jiebo
    Zha, Zheng-Jun
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 10871 - 10881
  • [10] ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding
    Guo, Zoey
    Tang, Yiwen
    Zhang, Ray
    Wang, Dong
    Wang, Zhigang
    Zhao, Bin
    Li, Xuelong
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15326 - 15337