A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future

被引:2
|
作者
Zhu, Chaoyang [1 ]
Chen, Long [1 ]
机构
[1] Hong Kong Univ Sci & Technol, Dept Comp Sci & Engn, Kowloon, Hong Kong, Peoples R China
关键词
Open-vocabulary; zero-shot learning; object detection; image segmentation; future directions; OBJECT; LANGUAGE;
D O I
10.1109/TPAMI.2024.3413013
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
As the most fundamental scene understanding tasks, object detection and segmentation have made tremendous progress in deep learning era. Due to the expensive manual labeling cost, the annotated categories in existing datasets are often small-scale and pre-defined, i.e., state-of-the-art fully-supervised detectors and segmentors fail to generalize beyond the closed vocabulary. To resolve this limitation, in the last few years, the community has witnessed an increasing attention toward Open-Vocabulary Detection (OVD) and Segmentation (OVS). By "open-vocabulary", we mean that the models can classify objects beyond pre-defined categories. In this survey, we provide a comprehensive review on recent developments of OVD and OVS. A taxonomy is first developed to organize different tasks and methodologies. We find that the permission and usage of weak supervision signals can well discriminate different methodologies, including: visual-semantic space mapping, novel visual feature synthesis, region-aware training, pseudo-labeling, knowledge distillation, and transfer learning. The proposed taxonomy is universal across different tasks, covering object detection, semantic/instance/panoptic segmentation, 3D and video understanding. The main design principles, key challenges, development routes, methodology strengths, and weaknesses are thoroughly analyzed.
引用
收藏
页码:8954 / 8975
页数:22
相关论文
共 50 条
  • [41] SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation
    Xie, Bin
    Cao, Jiale
    Xie, Jin
    Khan, Fahad Shahbaz
    Pang, Yanwei
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 3426 - 3436
  • [42] Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation
    Jiao, Siyu
    Zhu, Hongguang
    Huang, Jiannan
    Zhao, Yao
    Wei, Yunchao
    Shi, Humphrey
    COMPUTER VISION - ECCV 2024, PT XXXIII, 2025, 15091 : 399 - 416
  • [43] Open-Vocabulary Semantic Segmentation with Decoupled One-Pass Network
    Han, Cong
    Zhong, Yujie
    Li, Dengjie
    Han, Kai
    Ma, Lin
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 1086 - 1096
  • [44] Scaling Open-Vocabulary Image Segmentation with Image-Level Labels
    Ghiasi, Golnaz
    Gu, Xiuye
    Cui, Yin
    Lin, Tsung-Yi
    COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 540 - 557
  • [45] CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation
    Cho, Seokju
    Shin, Hoeseong
    Hong, Sunghwan
    Arnab, Anurag
    Seo, Paul Hongsuck
    Kim, Seungryong
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 4113 - 4123
  • [46] How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection
    Yao, Yiyang
    Liu, Peng
    Zhao, Tiancheng
    Zhang, Qianqian
    Liao, Jiajia
    Fang, Chunxin
    Lee, Kyusong
    Wang, Qing
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7, 2024, : 6630 - 6638
  • [47] Class Enhancement Losses With Pseudo Labels for Open-Vocabulary Semantic Segmentation
    Dao, Son Duy
    Shi, Hengcan
    Phung, Dinh
    Cai, Jianfei
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 8442 - 8453
  • [48] Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models
    Xu, Jiarui
    Liu, Sifei
    Vahdat, Arash
    Byeon, Wonmin
    Wang, Xiaolong
    De Meo, Shalini
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2955 - 2966
  • [49] A survey on face detection in the wild: Past, present and future
    Zafeiriou, Stefanos
    Zhang, Cha
    Zhang, Zhengyou
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2015, 138 : 1 - 24
  • [50] SHiNe: Semantic Hierarchy Nexus for Open-vocabulary Object Detection
    Liu, Mingxuan
    Hayes, Tyler L.
    Ricci, Elisa
    Csurka, Gabriela
    Volpi, Riccardo
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 16634 - 16644