DesCo: Learning Object Recognition with Rich Language Descriptions

被引:0
|
作者
Li, Liunian Harold [1 ]
Dou, Zi-Yi [1 ]
Peng, Nanyun [1 ]
Chang, Kai-Wei [1 ]
机构
[1] Univ Calif Los Angeles, Los Angeles, CA 90095 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent development in vision-language approaches has instigated a paradigm shift in learning visual recognition models from language supervision. These approaches align objects with language queries (e.g. "a photo of a cat") and thus improve the models' adaptability to novel objects and domains. Recent studies have attempted to query these models with complex language expressions that include specifications of fine-grained details, such as colors, shapes, and relations. However, simply incorporating language descriptions into queries does not guarantee accurate interpretation by the models. In fact, our experiments show that GLIP, a state-of-the-art vision-language model for object detection, often disregards contextual information in the language descriptions and instead relies heavily on detecting objects solely by their names. To tackle the challenge, we propose a new description-conditioned (DesCo) paradigm of learning object recognition models with rich language descriptions consisting of two innovations: 1) we employ a large language model as a commonsense knowledge engine to generate rich language descriptions of objects; 2) we design context-sensitive queries to improve the model's ability in deciphering intricate nuances embedded within descriptions and enforce the model to focus on context rather than object names alone. On two novel object detection benchmarks, LVIS and OminiLabel, under the zero-shot detection setting, our approach achieves 34.8 APr minival (+9.1) and 29.3 AP (+3.6), respectively, surpassing the prior state-of-the-art models, GLIP and FIBER, by a large margin.
引用
收藏
页数:16
相关论文
共 50 条
  • [1] Learning symbolic descriptions of shape for object recognition in X-ray images
    Maloof, MA
    Michalski, RS
    EXPERT SYSTEMS WITH APPLICATIONS, 1997, 12 (01) : 11 - 20
  • [2] Learning Robotic Grasping Strategy Based on Natural-Language Object Descriptions
    Rao, Achyutha Bharath
    Krishnan, Krishna
    He, Hongsheng
    2018 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2018, : 882 - 887
  • [3] ALIGNING PICTORIAL DESCRIPTIONS - AN APPROACH TO OBJECT RECOGNITION
    ULLMAN, S
    COGNITION, 1989, 32 (03) : 193 - 254
  • [4] Object recognition with structural descriptions and deformable models
    Schmalz, S
    Mertsching, B
    NEUROCOMPUTING, 2000, 31 (1-4) : 143 - 151
  • [5] The identification of index terms in natural language object descriptions
    Heidorn, PB
    ASIS 99: PROCEEDINGS OF THE 62ND ASIS ANNUAL MEETING, VOL 36, 1999: KNOWLEDGE: CREATION ORGANIZATION AND USE, 1999, 36 : 472 - 481
  • [6] Human hand descriptions and gesture recognition for object manipulation
    Guzman Cobos, Salvador
    Ferre, Manuel
    Angel Sanchez-Uran, M.
    Ortego, Javier
    Aracil, Rafael
    COMPUTER METHODS IN BIOMECHANICS AND BIOMEDICAL ENGINEERING, 2010, 13 (03) : 305 - 317
  • [7] Vietnamese Sign Language Recognition using Dynamic Object Extraction and Deep Learning
    Quan Pham Van
    Binh Nguyen Thanh
    IEEE ICCE 2020: 2020 IEEE EIGHTH INTERNATIONAL CONFERENCE ON COMMUNICATIONS AND ELECTRONICS (ICCE), 2021, : 402 - 407
  • [8] Towards Object Descriptions in Natural Language from Qualitative Models
    Falomir, Zoe
    Museros, Lledo
    Rodenas, Pablo
    Sanz, Ismael
    ARTIFICIAL INTELLIGENCE RESEARCH AND DEVELOPMENT, 2012, 248 : 59 - 68
  • [9] Learning features for object recognition
    Lin, YQ
    Bhanu, B
    GENETIC AND EVOLUTIONARY COMPUTATION - GECCO 2003, PT II, PROCEEDINGS, 2003, 2724 : 2227 - 2239
  • [10] Object detection and recognition by learning
    Choksuriwong, Anant
    Emile, Bruno
    Laurent, Helene
    2006 8TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING, VOLS 1-4, 2006, : 795 - 798