Towards Unsupervised Referring Expression Comprehension with Visual Semantic Parsing

被引:1
|
作者
Wang, Yaodong [1 ]
Ji, Zhong [1 ]
Wang, Di [1 ]
Pang, Yanwei [1 ]
Li, Xuelong [2 ]
机构
[1] Tianjin Univ, Sch Elect & Informat Engn, Tianjin 300072, Peoples R China
[2] Northwestern Polytech Univ, Sch Artificial Intelligence, OPt & Elect iOPEN, Xian 710072, Peoples R China
基金
中国国家自然科学基金;
关键词
referring expression comprehension; unsupervised learning; visual semantic parsing; RECONSTRUCTION;
D O I
10.1016/j.knosys.2023.111318
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Referring Expression Comprehension (REC) is a task that involves grounding a specific object in an image based on a given referring query in the form of bounding boxes. Existing fully-supervised or weakly-supervised REC methods rely on either manually annotated regional coordinates or query texts. In this paper, we propose an unsupervised training paradigm for the REC task that does not require any manual annotated data. Specifically, we introduce a <bold>V</bold>isual-Semantic-Parsing-based <bold>U</bold>nsupervised <bold>R</bold>eferring <bold>E</bold>xpression <bold>C</bold>omprehension framework (VUREC), which leverages a Visual Semantic Parser (VSP) as its core module to recognize the rich semantic information in images and construct pseudo-region-query pairs as the training supervision for REC. The VSP comprises a Scene Graph Parser (SGP) and a Visual Concept Detector (VCD) that can detect the locations, categories, attributes of objects, and visual relationships among them in images. Furthermore, we present a Referring Expression Reasoning (RER) model that utilizes a Multi-Modal Cascade Attention Decoder (MCAD) for fine-grained multi-modality fusion and regresses the four coordinates of the referential object directly. The experimental results on three benchmark datasets of Refcoco, Refcoco+ and Refcocog demonstrate the effectiveness of our proposed method.
引用
收藏
页数:10
相关论文
共 50 条
  • [1] Referring Expression Comprehension with Semantic Visual Relationship and Word Mapping
    Zhang, Chao
    Li, Weiming
    Ouyang, Wanli
    Wang, Qiang
    Kim, Woo-Shik
    Hong, Sunghoon
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 1258 - 1266
  • [2] Referring Expression Comprehension by Composing Semantic-based Visual Attention
    Zhu, Zheng-An
    Chiang, Hsuan-Lun
    Chiang, Chen-Kuo
    2022 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS - TAIWAN, IEEE ICCE-TW 2022, 2022, : 345 - 346
  • [3] Towards Further Comprehension on Referring Expression with Rationale
    Li, Rengang
    Fan, Baoyu
    Li, Xiaochuan
    Zhang, Runze
    Guo, Zhenhua
    Zhao, Kun
    Zhao, Yaqian
    Gong, Weifeng
    Wang, Endong
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4336 - 4344
  • [4] CSRef: Contrastive Semantic Alignment for Speech Referring Expression Comprehension
    Huang, Lihong
    Zhong, Sheng-Hua
    PROCEEDINGS OF THE 2ND INTERNATIONAL WORKSHOP ON METHODOLOGIES FOR MULTIMEDIA 2024, MEET4MM 2024, 2024, : 28 - 34
  • [5] Interactive Natural Language Grounding via Referring Expression Comprehension and Scene Graph Parsing
    Mi, Jinpeng
    Lyu, Jianzhi
    Tang, Song
    Li, Qingdu
    Zhang, Jianwei
    FRONTIERS IN NEUROROBOTICS, 2020, 14
  • [6] From Paraphrasing to Semantic Parsing: Unsupervised Semantic Parsing via Synchronous Semantic Decoding
    Wu, Shan
    Chen, Bo
    Xin, Chunlei
    Han, Xianpei
    Sun, Le
    Zhang, Weipeng
    Chen, Jiansong
    Yang, Fan
    Cai, Xunliang
    59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, 2021, : 5110 - 5121
  • [7] Unsupervised Semantic Parsing of Video Collections
    Sener, Ozan
    Zamir, Amir R.
    Savarese, Silvio
    Saxena, Ashutosh
    2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4480 - 4488
  • [8] Referring Expression Comprehension via Co-attention and Visual Context
    Gao, Youming
    Ji, Yi
    Xu, Ting
    Xu, Yunlong
    Liu, Chunping
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2019: IMAGE PROCESSING, PT III, 2019, 11729 : 119 - 130
  • [9] MUTATT: VISUAL-TEXTUAL MUTUAL GUIDANCE FOR REFERRING EXPRESSION COMPREHENSION
    Wang, Shuai
    Lyu, Fan
    Feng, Wei
    Wang, Song
    2020 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2020,
  • [10] Co-Grounding Networks with Semantic Attention for Referring Expression Comprehension in Videos
    Song, Sijie
    Lin, Xudong
    Liu, Jiaying
    Guo, Zongming
    Chang, Shih-Fu
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 1346 - 1355