DesCo: Learning Object Recognition with Rich Language Descriptions

被引:0
|
作者
Li, Liunian Harold [1 ]
Dou, Zi-Yi [1 ]
Peng, Nanyun [1 ]
Chang, Kai-Wei [1 ]
机构
[1] Univ Calif Los Angeles, Los Angeles, CA 90095 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent development in vision-language approaches has instigated a paradigm shift in learning visual recognition models from language supervision. These approaches align objects with language queries (e.g. "a photo of a cat") and thus improve the models' adaptability to novel objects and domains. Recent studies have attempted to query these models with complex language expressions that include specifications of fine-grained details, such as colors, shapes, and relations. However, simply incorporating language descriptions into queries does not guarantee accurate interpretation by the models. In fact, our experiments show that GLIP, a state-of-the-art vision-language model for object detection, often disregards contextual information in the language descriptions and instead relies heavily on detecting objects solely by their names. To tackle the challenge, we propose a new description-conditioned (DesCo) paradigm of learning object recognition models with rich language descriptions consisting of two innovations: 1) we employ a large language model as a commonsense knowledge engine to generate rich language descriptions of objects; 2) we design context-sensitive queries to improve the model's ability in deciphering intricate nuances embedded within descriptions and enforce the model to focus on context rather than object names alone. On two novel object detection benchmarks, LVIS and OminiLabel, under the zero-shot detection setting, our approach achieves 34.8 APr minival (+9.1) and 29.3 AP (+3.6), respectively, surpassing the prior state-of-the-art models, GLIP and FIBER, by a large margin.
引用
收藏
页数:16
相关论文
共 50 条
  • [31] Learning Mixed Templates for Object Recognition
    Si, Zhangzhang
    Gong, Haifeng
    Wu, Ying Nian
    Zhu, Song-Chun
    CVPR: 2009 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOLS 1-4, 2009, : 272 - 279
  • [32] A HYBRID LEARNING SYSTEM FOR OBJECT RECOGNITION
    Haeming, Klaus
    Peters, Gabriele
    ICINCO 2011: PROCEEDINGS OF THE 8TH INTERNATIONAL CONFERENCE ON INFORMATICS IN CONTROL, AUTOMATION AND ROBOTICS, VOL 2, 2011, : 329 - 332
  • [33] Compressed Learning for Tactile Object Recognition
    Hollis B.
    Patterson S.
    Trinkle J.
    Patterson, Stacy (sep@cs.rpi.edu), 1616, Institute of Electrical and Electronics Engineers Inc., United States (03) : 1616 - 1623
  • [34] Object Recognition by Stochastic Metric Learning
    Batchelor, Oliver
    Green, Richard
    SIMULATED EVOLUTION AND LEARNING (SEAL 2014), 2014, 8886 : 798 - 809
  • [35] Learning orthographic transformations for object recognition
    Bebis, G
    Georgiopoulos, M
    Bhatia, S
    SMC '97 CONFERENCE PROCEEDINGS - 1997 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS, VOLS 1-5: CONFERENCE THEME: COMPUTATIONAL CYBERNETICS AND SIMULATION, 1997, : 3576 - 3581
  • [36] Learning appearance models for object recognition
    Pope, Arthur R.
    Lowe, David G.
    Lecture Notes in Computer Science, 1144
  • [37] Local reinforcement learning for object recognition
    Peng, J
    Bhanu, B
    FOURTEENTH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOLS 1 AND 2, 1998, : 272 - 274
  • [38] Learning image components for object recognition
    Spratling, Michael W.
    JOURNAL OF MACHINE LEARNING RESEARCH, 2006, 7 : 793 - 815
  • [39] Coevolutionary feature learning for object recognition
    Krawiec, K
    Bhanu, B
    MACHINE LEARNING AND DATA MINING IN PATTERN RECOGNITION, PROCEEDINGS, 2003, 2734 : 224 - 238
  • [40] Learning image components for object recognition
    Division of Engineering, King's College London, GBR
    不详
    J. Mach. Learn. Res., 2006, (793-815):