DesCo: Learning Object Recognition with Rich Language Descriptions

被引：0

作者：

Li, Liunian Harold ^{[1
]}

Dou, Zi-Yi ^{[1
]}

Peng, Nanyun ^{[1
]}

Chang, Kai-Wei ^{[1
]}

机构：

[1] Univ Calif Los Angeles, Los Angeles, CA 90095 USA

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recent development in vision-language approaches has instigated a paradigm shift in learning visual recognition models from language supervision. These approaches align objects with language queries (e.g. "a photo of a cat") and thus improve the models' adaptability to novel objects and domains. Recent studies have attempted to query these models with complex language expressions that include specifications of fine-grained details, such as colors, shapes, and relations. However, simply incorporating language descriptions into queries does not guarantee accurate interpretation by the models. In fact, our experiments show that GLIP, a state-of-the-art vision-language model for object detection, often disregards contextual information in the language descriptions and instead relies heavily on detecting objects solely by their names. To tackle the challenge, we propose a new description-conditioned (DesCo) paradigm of learning object recognition models with rich language descriptions consisting of two innovations: 1) we employ a large language model as a commonsense knowledge engine to generate rich language descriptions of objects; 2) we design context-sensitive queries to improve the model's ability in deciphering intricate nuances embedded within descriptions and enforce the model to focus on context rather than object names alone. On two novel object detection benchmarks, LVIS and OminiLabel, under the zero-shot detection setting, our approach achieves 34.8 APr minival (+9.1) and 29.3 AP (+3.6), respectively, surpassing the prior state-of-the-art models, GLIP and FIBER, by a large margin.

引用

页数：16

共 50 条

[21] Skeletal descriptions of shape provide unique perceptual information for object recognition
Vladislav Ayzenberg
Stella F. Lourenco
Scientific Reports, 9
[22] Skeletal descriptions of shape provide unique perceptual information for object recognition
Ayzenberg, Vladislav
Lourenco, Stella F.
SCIENTIFIC REPORTS, 2019, 9 (1)
[23] Invariant descriptions and associative processing applied to object recognition under occlusions
Vázquez, RA
Sossa, H
Barrón, R
MICAI 2005: ADVANCES IN ARTIFICIAL INTELLIGENCE, 2005, 3789 : 318 - 327
[24] Perceptually based learning of shape descriptions for sketch recognition
Veselova, Y
Davis, R
PROCEEDING OF THE NINETEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE SIXTEENTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2004, : 482 - 487
[25] Learning content selection rules for generating object descriptions in dialogue
Jordan, PW
Walker, MA
JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2005, 24 : 157 - 194
[26] Learning content selection rules for generating object descriptions in dialogue
Jordan, P.W. (PJORDAN@PITT.EDU), 1600, AI Access Foundation (24):
[27] Context constraint in object learning and recognition
Liu, Z.
PERCEPTION, 1996, 25 : 92 - 92
[28] Learning visual variation for object recognition
Leksut, Jatuporn Toy
Zhao, Jiaping
Itti, Laurent
IMAGE AND VISION COMPUTING, 2020, 98
[29] Perceptual learning during object recognition
Furmanski, CS
Engel, SA
INVESTIGATIVE OPHTHALMOLOGY & VISUAL SCIENCE, 1997, 38 (04) : 3015 - 3015
[30] Embodied Object Representation Learning and Recognition
Van de Maele, Toon
Verbelen, Tim
Catal, Ozan
Dhoedt, Bart
FRONTIERS IN NEUROROBOTICS, 2022, 16

← 1 2 3 4 5 →