OpenSight: A Simple Open-Vocabulary Framework for LiDAR-Based Object Detection

被引:0
|
作者
Zhang, Hu [1 ]
Ku, Jianhua [4 ]
Tang, Tao [5 ]
Sun, Haiyang [6 ]
Huang, Xin [2 ]
Huang, Zi [2 ]
Yu, Kaicheng [3 ]
机构
[1] CSIRO DATA61, Sydney, NSW, Australia
[2] Univ Queensland, Brisbane, Qld, Australia
[3] Westlake Univ, Hangzhou, Peoples R China
[4] Alibaba, DAMO Acad, Beijing, Peoples R China
[5] Sun Yat Sen Univ, Shenzhen Campus, Shenzhen, Peoples R China
[6] LiAuto Inc, Beijing, Peoples R China
来源
关键词
OpenSight; Open-vocabulary; 3D object detection; VOXELNET;
D O I
10.1007/978-3-031-72907-2_1
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Traditional LiDAR-based object detection research primarily focuses on closed-set scenarios, which falls short in complex real-world applications. Directly transferring existing 2D open-vocabulary models with some known LiDAR classes for open-vocabulary ability, however, tends to suffer from over-fitting problems: The obtained model will detect the known objects, even presented with a novel category. In this paper, we propose OpenSight, a more advanced 2D-3D modeling framework for LiDAR-based open-vocabulary detection. OpenSight utilizes 2D-3D geometric priors for the initial discernment and localization of generic objects, followed by a more specific semantic interpretation of the detected objects. The process begins by generating 2D boxes for generic objects from the accompanying camera images of LiDAR. These 2D boxes, together with LiDAR points, are then lifted back into the LiDAR space to estimate corresponding 3D boxes. For better generic object perception, our framework integrates both temporal and spatial-aware constraints. Temporal awareness correlates the predicted 3D boxes across consecutive timestamps, recalibrating the missed or inaccurate boxes. The spatial awareness randomly places some "precisely" estimated 3D boxes at varying distances, increasing the visibility of generic objects. To interpret the specific semantics of detected objects, we develop a cross-modal alignment and fusion module to first align 3D features with 2D image embeddings and then fuse the aligned 3D-2D features for semantic decoding. Our experiments indicate that our method establishes state-of-the-art open-vocabulary performance on widely used 3D detection benchmarks and effectively identifies objects for new categories of interest.
引用
收藏
页码:1 / 19
页数:19
相关论文
共 50 条
  • [21] On Onboard LiDAR-Based Flying Object Detection
    Vrba, Matous
    Walter, Viktor
    Pritzl, Vaclav
    Pliska, Michal
    Baca, Tomas
    Spurny, Vojtech
    Hert, Daniel
    Saska, Martin
    IEEE TRANSACTIONS ON ROBOTICS, 2025, 41 : 593 - 611
  • [22] Localized Vision-Language Matching for Open-vocabulary Object Detection
    Bravo, Maria A.
    Mittal, Sudhanshu
    Brox, Thomas
    PATTERN RECOGNITION, DAGM GCPR 2022, 2022, 13485 : 393 - 408
  • [23] Open-vocabulary Object Segmentation with Diffusion Models
    Li, Ziyi
    Zhou, Qinye
    Zhang, Xiaoyun
    Zhang, Ya
    Wang, Yanfeng
    Xie, Weidi
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 7633 - 7642
  • [24] Unsupervised Open-Vocabulary Object Localization in Videos
    Fan, Ke
    Bai, Zechen
    Xiao, Tianjun
    Zietlow, Dominik
    Horn, Max
    Zhao, Zixu
    Simon-Gabriel, Carl-Johann
    Shou, Mike Zheng
    Locatello, Francesco
    Schiele, Bernt
    Brox, Thomas
    Zhang, Zheng
    Fu, Yanwei
    He, Tong
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 13701 - 13709
  • [25] OVTrack: Open-Vocabulary Multiple Object Tracking
    Li, Siyuan
    Fischer, Tobias
    Ke, Lei
    Ding, Henghui
    Danelljan, Martin
    Yu, Fisher
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 5567 - 5577
  • [26] Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model
    Du, Yu
    Wei, Fangyun
    Zhang, Zihe
    Shi, Miaojing
    Gao, Yue
    Li, Guoqi
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 14064 - 14073
  • [27] Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers
    Kim, Dahun
    Angelova, Anelia
    Kuo, Weicheng
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 11144 - 11154
  • [28] YOLO-World: Real-Time Open-Vocabulary Object Detection
    Cheng, Tianheng
    Sone, Lin
    Ge, Yixiao
    Liu, Wenyu
    Wang, Xinggang
    Shan, Yong
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 16901 - 16911
  • [29] VALO: A Versatile Anytime Framework for LiDAR-Based Object Detection Deep Neural Networks
    Soyyigit, Ahmet
    Yao, Shuochao
    Yun, Heechul
    IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2024, 43 (11) : 4045 - 4056
  • [30] Open-vocabulary object detection via debiased curriculum self-training
    Zhang, Hanlue
    Guan, Dayan
    Ke, Xiangrui
    El Saddik, Abdulmotaleb
    Lu, Shijian
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 255