DesCo: Learning Object Recognition with Rich Language Descriptions

被引:0
|
作者
Li, Liunian Harold [1 ]
Dou, Zi-Yi [1 ]
Peng, Nanyun [1 ]
Chang, Kai-Wei [1 ]
机构
[1] Univ Calif Los Angeles, Los Angeles, CA 90095 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent development in vision-language approaches has instigated a paradigm shift in learning visual recognition models from language supervision. These approaches align objects with language queries (e.g. "a photo of a cat") and thus improve the models' adaptability to novel objects and domains. Recent studies have attempted to query these models with complex language expressions that include specifications of fine-grained details, such as colors, shapes, and relations. However, simply incorporating language descriptions into queries does not guarantee accurate interpretation by the models. In fact, our experiments show that GLIP, a state-of-the-art vision-language model for object detection, often disregards contextual information in the language descriptions and instead relies heavily on detecting objects solely by their names. To tackle the challenge, we propose a new description-conditioned (DesCo) paradigm of learning object recognition models with rich language descriptions consisting of two innovations: 1) we employ a large language model as a commonsense knowledge engine to generate rich language descriptions of objects; 2) we design context-sensitive queries to improve the model's ability in deciphering intricate nuances embedded within descriptions and enforce the model to focus on context rather than object names alone. On two novel object detection benchmarks, LVIS and OminiLabel, under the zero-shot detection setting, our approach achieves 34.8 APr minival (+9.1) and 29.3 AP (+3.6), respectively, surpassing the prior state-of-the-art models, GLIP and FIBER, by a large margin.
引用
收藏
页数:16
相关论文
共 50 条
  • [21] Skeletal descriptions of shape provide unique perceptual information for object recognition
    Vladislav Ayzenberg
    Stella F. Lourenco
    Scientific Reports, 9
  • [22] Skeletal descriptions of shape provide unique perceptual information for object recognition
    Ayzenberg, Vladislav
    Lourenco, Stella F.
    SCIENTIFIC REPORTS, 2019, 9 (1)
  • [23] Invariant descriptions and associative processing applied to object recognition under occlusions
    Vázquez, RA
    Sossa, H
    Barrón, R
    MICAI 2005: ADVANCES IN ARTIFICIAL INTELLIGENCE, 2005, 3789 : 318 - 327
  • [24] Perceptually based learning of shape descriptions for sketch recognition
    Veselova, Y
    Davis, R
    PROCEEDING OF THE NINETEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE SIXTEENTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2004, : 482 - 487
  • [25] Learning content selection rules for generating object descriptions in dialogue
    Jordan, PW
    Walker, MA
    JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2005, 24 : 157 - 194
  • [26] Learning content selection rules for generating object descriptions in dialogue
    Jordan, P.W. (PJORDAN@PITT.EDU), 1600, AI Access Foundation (24):
  • [27] Context constraint in object learning and recognition
    Liu, Z.
    PERCEPTION, 1996, 25 : 92 - 92
  • [28] Learning visual variation for object recognition
    Leksut, Jatuporn Toy
    Zhao, Jiaping
    Itti, Laurent
    IMAGE AND VISION COMPUTING, 2020, 98
  • [29] Perceptual learning during object recognition
    Furmanski, CS
    Engel, SA
    INVESTIGATIVE OPHTHALMOLOGY & VISUAL SCIENCE, 1997, 38 (04) : 3015 - 3015
  • [30] Embodied Object Representation Learning and Recognition
    Van de Maele, Toon
    Verbelen, Tim
    Catal, Ozan
    Dhoedt, Bart
    FRONTIERS IN NEUROROBOTICS, 2022, 16