UniQRNet: Unifying Referring Expression Grounding and Segmentation with QRNet

被引:0
|
作者
Ye, Jiabo [1 ]
Tian, Junfeng [2 ]
Yan, Ming [2 ]
Xu, Haiyang [2 ]
Ye, Qinghao [2 ]
Shi, Yaya [3 ,4 ]
Yang, Xiaoshan [5 ]
Wang, Xuwu [6 ]
Zhang, Ji [2 ]
He, Liang [1 ]
Lin, Xin [1 ]
机构
[1] East China Normal Univ, Shanghai, Peoples R China
[2] Alibaba Grp, Hangzhou, Peoples R China
[3] Univ Sci & Technol China, Sch Informat Sci & Technol, Hefei, Peoples R China
[4] Chinese Acad Sci Inst Automa, NLPR, Beijing, Peoples R China
[5] CASIA, NLPR, Beijing, Peoples R China
[6] Fudan Univ, Handan Campus, Shanghai, Peoples R China
关键词
Referring expression comprehension; multi-modal understanding; multi- task learning; NAVIGATION;
D O I
10.1145/3660638
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Referring expression comprehension aims to align natural language queries with visual scenes, which requires establishing fine-grained correspondence between vision and language. This has important applications in multi-modal reasoning systems. Existing methods typically use text-agnostic visual backbones to extract features independently without considering the specific text input. However, we argue that the extracted visual features can be inconsistent with the referring expression, which hurts multi-modal understanding. To address this, we first propose Query-modulated Refinement Network (QRNet) that leverages language guidance to guide visual feature extraction. However, it only focuses on the grounding task that can only provide coarse-grained annotations in the form of bounding box coordinates. The guidance for the visual backbone is indirect, and the inconsistent issue still exists. To this end, we further propose UniQRNet, a multi-task framework over the QRNet to learn referring expression grounding and segmentation jointly. The framework introduces a multi-task head that leverages fine-grained pixel-level supervision from the segmentation task to directly guide the intermediate layers of QRNet to learn text-consistent visual features. Besides, UniQRNet also includes a loss balance strategy that allows two types of supervision signals to cooperate and optimize the model together. We conduct the most comprehensive comparison experiment covering four major datasets, ten evaluation set and three evaluation metrics used in previous work. UniQRNet outperforms previous state-of-the-art methods by a large margin on both referring comprehensive grounding (1.8%similar to 5.09%) and segmentation tasks (0.57%similar to 5.56%). Ablation and analysis reveal that UniQRNet can improve the consistency of visual features with text input and can bring significant performance improvement.
引用
收藏
页数:28
相关论文
共 50 条
  • [21] Co-Grounding Networks with Semantic Attention for Referring Expression Comprehension in Videos
    Song, Sijie
    Lin, Xudong
    Liu, Jiaying
    Guo, Zongming
    Chang, Shih-Fu
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 1346 - 1355
  • [22] Unpaired referring expression grounding via bidirectional cross-modal matching
    Shi, Hengcan
    Hayat, Munawar
    Cai, Jianfei
    NEUROCOMPUTING, 2023, 518 : 39 - 49
  • [23] RESMatch: Referring expression segmentation in a semi-supervised manner
    Zang, Ying
    Cao, Runlong
    Fu, Chenglong
    Zhu, Didi
    Zhang, Min
    Hu, Wenjun
    Zhu, Lanyun
    Chen, Tianrun
    INFORMATION SCIENCES, 2025, 694
  • [24] Weakly supervised video object segmentation initialized with referring expression
    Bu, Xiaoqing
    Sun, Yukuan
    Wang, Jianming
    Liu, Kunliang
    Liang, Jiayu
    Jin, Guanghao
    Chung, Tae-Sun
    NEUROCOMPUTING, 2021, 453 : 754 - 765
  • [25] Weakly Supervised Referring Expression Grounding via Dynamic Self-Knowledge Distillation
    Mi, Jinpeng
    Chen, Zhiqian
    Zhang, Jianwei
    2023 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, IROS, 2023, : 1254 - 1260
  • [26] Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing
    Liu, Xihui
    Wang, Zihao
    Shao, Jing
    Wang, Xiaogang
    Li, Hongsheng
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 1950 - 1959
  • [27] Interactive Natural Language Grounding via Referring Expression Comprehension and Scene Graph Parsing
    Mi, Jinpeng
    Lyu, Jianzhi
    Tang, Song
    Li, Qingdu
    Zhang, Jianwei
    FRONTIERS IN NEUROROBOTICS, 2020, 14
  • [28] Knowledge-guided Pairwise Reconstruction Network for Weakly Supervised Referring Expression Grounding
    Liu, Xuejing
    Li, Liang
    Wang, Shuhui
    Zha, Zheng-Jun
    Su, Li
    Huang, Qingming
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 539 - 547
  • [29] Cycle-Free Weakly Referring Expression Grounding With Self-Paced Learning
    Sun, Mingjie
    Xiao, Jimin
    Lim, Eng Gee
    Zhao, Yao
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 1611 - 1621
  • [30] Entity-Enhanced Adaptive Reconstruction Network for Weakly Supervised Referring Expression Grounding
    Liu, Xuejing
    Li, Liang
    Wang, Shuhui
    Zha, Zheng-Jun
    Li, Zechao
    Tian, Qi
    Huang, Qingming
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (03) : 3003 - 3018