Towards Unsupervised Referring Expression Comprehension with Visual Semantic Parsing

被引：1

作者：

Wang, Yaodong ^{[1
]}

Ji, Zhong ^{[1
]}

Wang, Di ^{[1
]}

Pang, Yanwei ^{[1
]}

Li, Xuelong ^{[2
]}

机构：

[1] Tianjin Univ, Sch Elect & Informat Engn, Tianjin 300072, Peoples R China

[2] Northwestern Polytech Univ, Sch Artificial Intelligence, OPt & Elect iOPEN, Xian 710072, Peoples R China

来源：

KNOWLEDGE-BASED SYSTEMS | 2024年 / 285卷

基金：

中国国家自然科学基金;

关键词：

referring expression comprehension; unsupervised learning; visual semantic parsing; RECONSTRUCTION;

D O I：

10.1016/j.knosys.2023.111318

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Referring Expression Comprehension (REC) is a task that involves grounding a specific object in an image based on a given referring query in the form of bounding boxes. Existing fully-supervised or weakly-supervised REC methods rely on either manually annotated regional coordinates or query texts. In this paper, we propose an unsupervised training paradigm for the REC task that does not require any manual annotated data. Specifically, we introduce a <bold>V</bold>isual-Semantic-Parsing-based <bold>U</bold>nsupervised <bold>R</bold>eferring <bold>E</bold>xpression <bold>C</bold>omprehension framework (VUREC), which leverages a Visual Semantic Parser (VSP) as its core module to recognize the rich semantic information in images and construct pseudo-region-query pairs as the training supervision for REC. The VSP comprises a Scene Graph Parser (SGP) and a Visual Concept Detector (VCD) that can detect the locations, categories, attributes of objects, and visual relationships among them in images. Furthermore, we present a Referring Expression Reasoning (RER) model that utilizes a Multi-Modal Cascade Attention Decoder (MCAD) for fine-grained multi-modality fusion and regresses the four coordinates of the referential object directly. The experimental results on three benchmark datasets of Refcoco, Refcoco+ and Refcocog demonstrate the effectiveness of our proposed method.

引用

页数：10

共 50 条

[31] LUNA: Language as Continuing Anchors for Referring Expression Comprehension
Liang, Yaoyuan
Yang, Zhao
Tang, Yansong
Fan, Jiashuo
Li, Ziran
Wang, Jingang
Torr, Philip H. S.
Huang, Shao-Lun
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5174 - 5184
[32] MAttNet: Modular Attention Network for Referring Expression Comprehension
Yu, Licheng
Lin, Zhe
Shen, Xiaohui
Yang, Jimei
Lu, Xin
Bansal, Mohit
Berg, Tamara L.
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 1307 - 1315
[33] Referring Expression Comprehension Using Language Adaptive Inference
Su, Wei
Miao, Peihan
Dou, Huanzhang
Fu, Yongjian
Li, Xi
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 2, 2023, : 2357 - 2365
[34] Referring Expression Comprehension with Multi-Cross Decoder
Yi, Zhou Zi
Feng, Fu Xiao
Ran, Li Xiao
2024 16TH INTERNATIONAL CONFERENCE ON COMPUTER AND AUTOMATION ENGINEERING, ICCAE 2024, 2024, : 588 - 593
[35] Decoupling-Cooperative Framework for Referring Expression Comprehension
Song, Yun-Zhu
Chen, Yi-Syuan
Shuai, Hong-Han
IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 1542 - 1546
[36] Knowledge Mining of Scene Text for Referring Expression Comprehension
Gao, Chenyang
Yang, Biao
Yu, Wenwen
Liu, Yuliang
Bai, Xiang
DOCUMENT ANALYSIS AND RECOGNITION-ICDAR 2024, PT V, 2024, 14808 : 245 - 262
[37] Scene-Text Oriented Referring Expression Comprehension
Bu, Yuqi
Li, Liuwu
Xie, Jiayuan
Liu, Qiong
Cai, Yi
Huang, Qingbao
Li, Qing
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 7208 - 7221
[38] Unambiguous Scene Text Segmentation With Referring Expression Comprehension
Rong, Xuejian
Yi, Chucai
Tian, Yingli
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 (29) : 591 - 601
[39] Neural correlates of semantic-driven syntactic parsing in sentence comprehension
Zhang, Yun
Taft, Marcus
Tang, Jiaman
Li, Le
NEUROIMAGE, 2024, 289
[40] Semantic separator learning and its applications in unsupervised Chinese text parsing
Wu, Yuming
Luo, Xiaodong
Yang, Zhen
FRONTIERS OF COMPUTER SCIENCE, 2013, 7 (01) : 55 - 68

← 1 2 3 4 5 →