OV-NeRF: Open-Vocabulary Neural Radiance Fields With Vision and Language Foundation Models for 3D Semantic Understanding

被引:0
|
作者
Liao, Guibiao [1 ]
Zhou, Kaichen [2 ]
Bao, Zhenyu [1 ]
Liu, Kanglin [3 ]
Li, Qing [3 ]
机构
[1] Peking Univ, Sch Elect & Comp Engn, Shenzhen 518055, Peoples R China
[2] Univ Oxford, Dept Comp Sci, Oxford OX1 2JD, Oxfordshire, England
[3] Pengcheng Lab, Shenzhen 518066, Peoples R China
基金
中国国家自然科学基金;
关键词
Semantics; Three-dimensional displays; Neural radiance field; Training; Solid modeling; Rendering (computer graphics); Circuits and systems; open-vocabulary; vision and language foundation models; cross-view self-enhancement;
D O I
10.1109/TCSVT.2024.3439737
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
The development of Neural Radiance Fields (NeRFs) has provided a potent representation for encapsulating the geometric and appearance characteristics of 3D scenes. Enhancing the capabilities of NeRFs in open-vocabulary 3D semantic perception tasks has been a recent focus. However, current methods that extract semantics directly from Contrastive Language-Image Pretraining (CLIP) for semantic field learning encounter difficulties due to noisy and view-inconsistent semantics provided by CLIP. To tackle these limitations, we propose OV-NeRF, which exploits the potential of pre-trained vision and language foundation models to enhance semantic field learning through proposed single-view and cross-view strategies. First, from the single-view perspective, we introduce Region Semantic Ranking (RSR) regularization by leveraging 2D mask proposals derived from Segment Anything (SAM) to rectify the noisy semantics of each training view, facilitating accurate semantic field learning. Second, from the cross-view perspective, we propose a Cross-view Self-enhancement (CSE) strategy to address the challenge raised by view-inconsistent semantics. Rather than invariably utilizing the 2D inconsistent semantics from CLIP, CSE leverages the 3D consistent semantics generated from the well-trained semantic field itself for semantic field training, aiming to reduce ambiguity and enhance overall semantic consistency across different views. Extensive experiments validate our OV-NeRF outperforms current state-of-the-art methods, achieving a significant improvement of 20.31% and 18.42% in mIoU metric on Replica and ScanNet, respectively. Furthermore, our approach exhibits consistent superior results across various CLIP configurations, further verifying its robustness. Codes are available at: https://github.com/pcl3dv/OV-NeRF.
引用
收藏
页码:12923 / 12936
页数:14
相关论文
共 26 条
  • [1] Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding
    Shi, Jin-Chuan
    Wang, Miao
    Duan, Hao-Bin
    Guan, Shao-Hua
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 5333 - 5343
  • [2] PLA: Language-Driven Open-Vocabulary 3D Scene Understanding
    Ding, Runyu
    Yang, Jihan
    Xue, Chuhui
    Zhang, Wenqing
    Bai, Song
    Qi, Xiaojuan
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 7010 - 7019
  • [3] LANGUAGE-DRIVEN OPEN-VOCABULARY 3D SEMANTIC SEGMENTATION WITH KNOWLEDGE DISTILLATION
    Wu, Yuting
    Han, Xian-Feng
    Xiao, Guoqiang
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 3320 - 3324
  • [4] Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models
    Zhu, Xiaoyu
    Zhou, Hao
    Xing, Pengfei
    Zhao, Long
    Xu, Hao
    Liang, Junwei
    Hauptmann, Alexander
    Liu, Ting
    Gallagher, Andrew
    COMPUTER VISION - ECCV 2024, PT XXIX, 2025, 15087 : 357 - 375
  • [5] OpenObj: Open-Vocabulary Object-Level Neural Radiance Fields With Fine-Grained Understanding
    Deng, Yinan
    Wang, Jiahui
    Zhao, Jingyu
    Dou, Jianyu
    Yang, Yi
    Yue, Yufeng
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2025, 10 (01): : 652 - 659
  • [6] 3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation
    Xiao, Zihao
    Jing, Longlong
    Wu, Shangxuan
    Zhu, Alex Zihao
    Ji, Jingwei
    Jiang, Chiyu Max
    Hung, Wei-Chih
    Funkhouser, Thomas
    Kuo, Weicheng
    Angelova, Anelia
    Zhou, Yin
    Sheng, Shiwei
    COMPUTER VISION - ECCV 2024, PT XL, 2025, 15098 : 21 - 38
  • [7] Expanding Open-Vocabulary Understanding for UAV Aerial Imagery: A Vision-Language Framework to Semantic Segmentation
    Huang, Bangju
    Li, Junhui
    Luan, Wuyang
    Tan, Jintao
    Li, Chenglong
    Huang, Longyang
    DRONES, 2025, 9 (02)
  • [8] Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding
    Li, Ruihuang
    Zhang, Zhengqiang
    He, Chenheng
    Ma, Zhiyuan
    Patel, Vishal M.
    Zhang, Lei
    COMPUTER VISION - ECCV 2024, PT XLIX, 2025, 15107 : 416 - 434
  • [9] GSNeRF: Generalizable Semantic Neural Radiance Fields with Enhanced 3D Scene Understanding
    Chou, Zi-Ting
    Huang, Sheng-Yu
    Liu, I-Jieh
    Wang, Yu-Chiang Frank
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 20806 - 20815
  • [10] FM-OV3D: Foundation Model-Based Cross-Modal Knowledge Blending for Open-Vocabulary 3D Detection
    Zhang, Dongmei
    Li, Chang
    Zhang, Renrui
    Xie, Shenghao
    Xue, Wei
    Xie, Xiaodong
    Zhang, Shanghang
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 15, 2024, : 16723 - 16731