Towards Unsupervised Referring Expression Comprehension with Visual Semantic Parsing

被引：1

作者：

Wang, Yaodong ^{[1
]}

Ji, Zhong ^{[1
]}

Wang, Di ^{[1
]}

Pang, Yanwei ^{[1
]}

Li, Xuelong ^{[2
]}

机构：

[1] Tianjin Univ, Sch Elect & Informat Engn, Tianjin 300072, Peoples R China

[2] Northwestern Polytech Univ, Sch Artificial Intelligence, OPt & Elect iOPEN, Xian 710072, Peoples R China

来源：

KNOWLEDGE-BASED SYSTEMS | 2024年 / 285卷

基金：

中国国家自然科学基金;

关键词：

referring expression comprehension; unsupervised learning; visual semantic parsing; RECONSTRUCTION;

D O I：

10.1016/j.knosys.2023.111318

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Referring Expression Comprehension (REC) is a task that involves grounding a specific object in an image based on a given referring query in the form of bounding boxes. Existing fully-supervised or weakly-supervised REC methods rely on either manually annotated regional coordinates or query texts. In this paper, we propose an unsupervised training paradigm for the REC task that does not require any manual annotated data. Specifically, we introduce a <bold>V</bold>isual-Semantic-Parsing-based <bold>U</bold>nsupervised <bold>R</bold>eferring <bold>E</bold>xpression <bold>C</bold>omprehension framework (VUREC), which leverages a Visual Semantic Parser (VSP) as its core module to recognize the rich semantic information in images and construct pseudo-region-query pairs as the training supervision for REC. The VSP comprises a Scene Graph Parser (SGP) and a Visual Concept Detector (VCD) that can detect the locations, categories, attributes of objects, and visual relationships among them in images. Furthermore, we present a Referring Expression Reasoning (RER) model that utilizes a Multi-Modal Cascade Attention Decoder (MCAD) for fine-grained multi-modality fusion and regresses the four coordinates of the referential object directly. The experimental results on three benchmark datasets of Refcoco, Refcoco+ and Refcocog demonstrate the effectiveness of our proposed method.

引用

页数：10

共 50 条

[1] Referring Expression Comprehension with Semantic Visual Relationship and Word Mapping
Zhang, Chao
Li, Weiming
Ouyang, Wanli
Wang, Qiang
Kim, Woo-Shik
Hong, Sunghoon
PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 1258 - 1266
[2] Referring Expression Comprehension by Composing Semantic-based Visual Attention
Zhu, Zheng-An
Chiang, Hsuan-Lun
Chiang, Chen-Kuo
2022 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS - TAIWAN, IEEE ICCE-TW 2022, 2022, : 345 - 346
[3] Towards Further Comprehension on Referring Expression with Rationale
Li, Rengang
Fan, Baoyu
Li, Xiaochuan
Zhang, Runze
Guo, Zhenhua
Zhao, Kun
Zhao, Yaqian
Gong, Weifeng
Wang, Endong
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4336 - 4344
[4] CSRef: Contrastive Semantic Alignment for Speech Referring Expression Comprehension
Huang, Lihong
Zhong, Sheng-Hua
PROCEEDINGS OF THE 2ND INTERNATIONAL WORKSHOP ON METHODOLOGIES FOR MULTIMEDIA 2024, MEET4MM 2024, 2024, : 28 - 34
[5] Interactive Natural Language Grounding via Referring Expression Comprehension and Scene Graph Parsing
Mi, Jinpeng
Lyu, Jianzhi
Tang, Song
Li, Qingdu
Zhang, Jianwei
FRONTIERS IN NEUROROBOTICS, 2020, 14
[6] From Paraphrasing to Semantic Parsing: Unsupervised Semantic Parsing via Synchronous Semantic Decoding
Wu, Shan
Chen, Bo
Xin, Chunlei
Han, Xianpei
Sun, Le
Zhang, Weipeng
Chen, Jiansong
Yang, Fan
Cai, Xunliang
59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, 2021, : 5110 - 5121
[7] Unsupervised Semantic Parsing of Video Collections
Sener, Ozan
Zamir, Amir R.
Savarese, Silvio
Saxena, Ashutosh
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4480 - 4488
[8] Referring Expression Comprehension via Co-attention and Visual Context
Gao, Youming
Ji, Yi
Xu, Ting
Xu, Yunlong
Liu, Chunping
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2019: IMAGE PROCESSING, PT III, 2019, 11729 : 119 - 130
[9] MUTATT: VISUAL-TEXTUAL MUTUAL GUIDANCE FOR REFERRING EXPRESSION COMPREHENSION
Wang, Shuai
Lyu, Fan
Feng, Wei
Wang, Song
2020 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2020,
[10] Co-Grounding Networks with Semantic Attention for Referring Expression Comprehension in Videos
Song, Sijie
Lin, Xudong
Liu, Jiaying
Guo, Zongming
Chang, Shih-Fu
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 1346 - 1355

← 1 2 3 4 5 →