UniQRNet: Unifying Referring Expression Grounding and Segmentation with QRNet

被引:0
|
作者
Ye, Jiabo [1 ]
Tian, Junfeng [2 ]
Yan, Ming [2 ]
Xu, Haiyang [2 ]
Ye, Qinghao [2 ]
Shi, Yaya [3 ,4 ]
Yang, Xiaoshan [5 ]
Wang, Xuwu [6 ]
Zhang, Ji [2 ]
He, Liang [1 ]
Lin, Xin [1 ]
机构
[1] East China Normal Univ, Shanghai, Peoples R China
[2] Alibaba Grp, Hangzhou, Peoples R China
[3] Univ Sci & Technol China, Sch Informat Sci & Technol, Hefei, Peoples R China
[4] Chinese Acad Sci Inst Automa, NLPR, Beijing, Peoples R China
[5] CASIA, NLPR, Beijing, Peoples R China
[6] Fudan Univ, Handan Campus, Shanghai, Peoples R China
关键词
Referring expression comprehension; multi-modal understanding; multi- task learning; NAVIGATION;
D O I
10.1145/3660638
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Referring expression comprehension aims to align natural language queries with visual scenes, which requires establishing fine-grained correspondence between vision and language. This has important applications in multi-modal reasoning systems. Existing methods typically use text-agnostic visual backbones to extract features independently without considering the specific text input. However, we argue that the extracted visual features can be inconsistent with the referring expression, which hurts multi-modal understanding. To address this, we first propose Query-modulated Refinement Network (QRNet) that leverages language guidance to guide visual feature extraction. However, it only focuses on the grounding task that can only provide coarse-grained annotations in the form of bounding box coordinates. The guidance for the visual backbone is indirect, and the inconsistent issue still exists. To this end, we further propose UniQRNet, a multi-task framework over the QRNet to learn referring expression grounding and segmentation jointly. The framework introduces a multi-task head that leverages fine-grained pixel-level supervision from the segmentation task to directly guide the intermediate layers of QRNet to learn text-consistent visual features. Besides, UniQRNet also includes a loss balance strategy that allows two types of supervision signals to cooperate and optimize the model together. We conduct the most comprehensive comparison experiment covering four major datasets, ten evaluation set and three evaluation metrics used in previous work. UniQRNet outperforms previous state-of-the-art methods by a large margin on both referring comprehensive grounding (1.8%similar to 5.09%) and segmentation tasks (0.57%similar to 5.56%). Ablation and analysis reveal that UniQRNet can improve the consistency of visual features with text input and can bring significant performance improvement.
引用
收藏
页数:28
相关论文
共 50 条
  • [31] Weakly Supervised Referring Expression Grounding via Target-Guided Knowledge Distillation
    Mi, Jinpeng
    Tang, Song
    Ma, Zhiyuan
    Liu, Dan
    Li, Qingdu
    Zhang, Jianwei
    2023 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2023), 2023, : 8299 - 8305
  • [32] INGRESS: Interactive visual grounding of referring expressions
    Shridhar, Mohit
    Mittal, Dixant
    Hsu, David
    INTERNATIONAL JOURNAL OF ROBOTICS RESEARCH, 2020, 39 (2-3): : 217 - 232
  • [33] RefCrowd: Grounding the Target in Crowd with Referring Expressions
    Qiu, Heqian
    Li, Hongliang
    Zhao, Taijin
    Wang, Lanxiao
    Wu, Qingbo
    Meng, Fanman
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4435 - 4444
  • [34] Grounding Referring Expressions in Images by Variational Context
    Zhang, Hanwang
    Niu, Yulei
    Chang, Shih-Fu
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 4158 - 4166
  • [35] SAFARI: Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation
    Nag, Sayan
    Goswami, Koustava
    Karanam, Srikrishna
    COMPUTER VISION-ECCV 2024, PT XLIV, 2025, 15102 : 485 - 503
  • [36] Expression Prompt Collaboration Transformer for universal referring video object segmentation
    Chen, Jiajun
    Lin, Jiacheng
    Zhong, Guojin
    Fu, Haolong
    Nai, Ke
    Yang, Kailun
    Li, Zhiyong
    KNOWLEDGE-BASED SYSTEMS, 2025, 311
  • [37] Cross-modality synergy network for referring expression comprehension and segmentation
    Li, Qianzhong
    Zhang, Yujia
    Sun, Shiying
    Wu, Jinting
    Zhao, Xiaoguang
    Tan, Min
    NEUROCOMPUTING, 2022, 467 : 99 - 114
  • [38] Multiple Relational Learning Network for Joint Referring Expression Comprehension and Segmentation
    Hua, Guoguang
    Liao, Muxin
    Tian, Shishun
    Zhang, Yuhang
    Zou, Wenbin
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 8805 - 8816
  • [39] Key-Word-Aware Network for Referring Expression Image Segmentation
    Shi, Hengcan
    Li, Hongliang
    Meng, Fanman
    Wu, Qingbo
    COMPUTER VISION - ECCV 2018, PT VI, 2018, 11210 : 38 - 54
  • [40] Ref-NMS: Breaking Proposal Bottlenecks in Two-Stage Referring Expression Grounding
    Chen, Long
    Ma, Wenbo
    Xiao, Jun
    Zhang, Hanwang
    Chang, Shih-Fu
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 1036 - 1044