DesCo: Learning Object Recognition with Rich Language Descriptions

被引:0
|
作者
Li, Liunian Harold [1 ]
Dou, Zi-Yi [1 ]
Peng, Nanyun [1 ]
Chang, Kai-Wei [1 ]
机构
[1] Univ Calif Los Angeles, Los Angeles, CA 90095 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent development in vision-language approaches has instigated a paradigm shift in learning visual recognition models from language supervision. These approaches align objects with language queries (e.g. "a photo of a cat") and thus improve the models' adaptability to novel objects and domains. Recent studies have attempted to query these models with complex language expressions that include specifications of fine-grained details, such as colors, shapes, and relations. However, simply incorporating language descriptions into queries does not guarantee accurate interpretation by the models. In fact, our experiments show that GLIP, a state-of-the-art vision-language model for object detection, often disregards contextual information in the language descriptions and instead relies heavily on detecting objects solely by their names. To tackle the challenge, we propose a new description-conditioned (DesCo) paradigm of learning object recognition models with rich language descriptions consisting of two innovations: 1) we employ a large language model as a commonsense knowledge engine to generate rich language descriptions of objects; 2) we design context-sensitive queries to improve the model's ability in deciphering intricate nuances embedded within descriptions and enforce the model to focus on context rather than object names alone. On two novel object detection benchmarks, LVIS and OminiLabel, under the zero-shot detection setting, our approach achieves 34.8 APr minival (+9.1) and 29.3 AP (+3.6), respectively, surpassing the prior state-of-the-art models, GLIP and FIBER, by a large margin.
引用
收藏
页数:16
相关论文
共 50 条
  • [41] Learning spatial relations in object recognition
    Pham, Thang V.
    Smeulders, Arnold W. M.
    PATTERN RECOGNITION LETTERS, 2006, 27 (14) : 1673 - 1684
  • [42] Object recognition by stochastic metric learning
    Batchelor, Oliver
    Green, Richard
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2014, 8886 : 798 - 809
  • [43] Reinforcement learning in multiresolution object recognition
    Iftekharuddin, KM
    Widjanarko, T
    2004 IEEE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, VOLS 1-4, PROCEEDINGS, 2004, : 1085 - 1090
  • [44] Visual object recognition with supervised learning
    Heisele, B
    IEEE INTELLIGENT SYSTEMS, 2003, 18 (03) : 38 - 42
  • [45] METADATA DEFINING OBJECT LEARNING FOR MOBILE LANGUAGE LEARNING
    Achtaich, Khadija
    Benlahmar, Habib
    Achtaich, Naceur
    6TH INTERNATIONAL CONFERENCE OF EDUCATION, RESEARCH AND INNOVATION (ICERI 2013), 2013, : 1468 - 1478
  • [46] Object recognition and articulated object learning by accumulative Hopfield matching
    Li, WJ
    Lee, T
    PATTERN RECOGNITION, 2002, 35 (09) : 1933 - 1948
  • [47] Perceptual learning in object recognition: object specificity and size Invariance
    Furmanski, CS
    Engel, SA
    VISION RESEARCH, 2000, 40 (05) : 473 - 484
  • [48] Automatically Generating Natural Language Descriptions for Object-Related Statement Sequences
    Wang, Xiaoran
    Pollock, Lori
    Vijay-Shanker, K.
    2017 IEEE 24TH INTERNATIONAL CONFERENCE ON SOFTWARE ANALYSIS, EVOLUTION, AND REENGINEERING (SANER), 2017, : 205 - 216
  • [49] Object Localization Based on Natural Language Descriptions for Fine-Grained Image
    Duan, Lijuan
    Liang, Mingliang
    En, Qing
    Qiao, Yuanhua
    Miao, Jun
    Ma, Longlong
    INTERNATIONAL SYMPOSIUM ON ARTIFICIAL INTELLIGENCE AND ROBOTICS 2020, 2020, 11574
  • [50] GENERIC OBJECT RECOGNITION - BUILDING AND MATCHING COARSE DESCRIPTIONS FROM LINE DRAWINGS
    BERGEVIN, R
    LEVINE, MD
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1993, 15 (01) : 19 - 36