LaMI-DETR: Open-Vocabulary Detection with Language Model Instruction

被引:0
|
作者
Du, Penghui [1 ,2 ,3 ]
Wang, Yu [2 ]
Sung, Yifan [2 ]
Wang, Luting [1 ]
Li, Yue [1 ]
Zhang, Gang [2 ]
Ding, Errui [2 ]
Wang, Yan [3 ]
Wang, Jingdong [2 ]
Liu, Si [1 ]
机构
[1] Beihang Univ, Beijing, Peoples R China
[2] Baidu, Beijing, Peoples R China
[3] Tsinghua Univ, AIR, Beijing, Peoples R China
来源
基金
北京市自然科学基金; 中国国家自然科学基金;
关键词
Inter-category Relationships; Language Model; DETR;
D O I
10.1007/978-3-031-73337-6_18
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Existing methods enhance open-vocabulary object detection by leveraging the robust open-vocabulary recognition capabilities of Vision-Language Models (VLMs), such as CLIP. However, two main challenges emerge: (1) A deficiency in concept representation, where the category names in CLIP's text space lack textual and visual knowledge. (2) An overfitting tendency towards base categories, with the open vocabulary knowledge biased towards base categories during the transfer from VLMs to detectors. To address these challenges, we propose the Language Model Instruction (LaMI) strategy, which leverages the relationships between visual concepts and applies them within a simple yet effective DETR-like detector, termed LaMI-DETR. LaMI utilizes GPT to construct visual concepts and employs T5 to investigate visual similarities across categories. These inter-category relationships refine concept representation and avoid overfitting to base categories. Comprehensive experiments validate our approach's superior performance over existing methods in the same rigorous setting without reliance on external training resources. LaMI-DETR achieves a rare box AP of 43.4 on OV-LVIS, surpassing the previous best by 7.8 rare box AP.
引用
收藏
页码:312 / 328
页数:17
相关论文
共 50 条
  • [21] Open-Vocabulary One-Stage Detection with Hierarchical Visual-Language Knowledge Distillation
    Mal, Zongyang
    Luo, Guan
    Gao, Jin
    Li, Liang
    Chen, Yuxin
    Wang, Shaoru
    Zhang, Congxuan
    Hu, Weiming
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 14054 - 14063
  • [22] SHiNe: Semantic Hierarchy Nexus for Open-vocabulary Object Detection
    Liu, Mingxuan
    Hayes, Tyler L.
    Ricci, Elisa
    Csurka, Gabriela
    Volpi, Riccardo
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 16634 - 16644
  • [23] Open-Vocabulary Object Detection via Scene Graph Discovery
    Shi, Hengcan
    Hayat, Munawar
    Cai, Jianfei
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4012 - 4021
  • [24] Latent human traits in the language of social media: An open-vocabulary approach
    Kulkarni, Vivek
    Kern, Margaret L.
    Stillwell, David
    Kosinski, Michel
    Matz, Sandra
    Ungar, Lyle
    Skiena, Steven
    Schwartz, H. Andrew
    PLOS ONE, 2018, 13 (11):
  • [25] A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future
    Zhu, Chaoyang
    Chen, Long
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (12) : 8954 - 8975
  • [26] Open-Vocabulary Multi-label Image Classification with Pretrained Vision-Language Model
    Dao, Son D.
    Huynh, Dat
    Zhao, He
    Phung, Dinh
    Cai, Jianfei
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 2135 - 2140
  • [27] Learning to Create and Reuse Words in Open-Vocabulary Neural Language Modeling
    Kawakami, Kazuya
    Dyer, Chris
    Blunsom, Phil
    PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, : 1492 - 1502
  • [28] CLIFF: Continual Latent Diffusion for Open-Vocabulary Object Detection
    Li, Wuyang
    Liu, Xinyu
    Ma, Jiayi
    Yuan, Yixuan
    COMPUTER VISION - ECCV 2024, PT LV, 2025, 15113 : 255 - 273
  • [29] Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach
    Schwartz, H. Andrew
    Eichstaedt, Johannes C.
    Kern, Margaret L.
    Dziurzynski, Lukasz
    Ramones, Stephanie M.
    Agrawal, Megha
    Shah, Achal
    Kosinski, Michal
    Stillwell, David
    Seligman, Martin E. P.
    Ungar, Lyle H.
    PLOS ONE, 2013, 8 (09):
  • [30] PromptDet: Towards Open-Vocabulary Detection Using Uncurated Images
    Feng, Chengjian
    Zhong, Yujie
    Jie, Zequn
    Chu, Xiangxiang
    Ren, Haibing
    Wei, Xiaolin
    Xie, Weidi
    Ma, Lin
    COMPUTER VISION, ECCV 2022, PT IX, 2022, 13669 : 701 - 717