UniQRNet: Unifying Referring Expression Grounding and Segmentation with QRNet

被引:0
|
作者
Ye, Jiabo [1 ]
Tian, Junfeng [2 ]
Yan, Ming [2 ]
Xu, Haiyang [2 ]
Ye, Qinghao [2 ]
Shi, Yaya [3 ,4 ]
Yang, Xiaoshan [5 ]
Wang, Xuwu [6 ]
Zhang, Ji [2 ]
He, Liang [1 ]
Lin, Xin [1 ]
机构
[1] East China Normal Univ, Shanghai, Peoples R China
[2] Alibaba Grp, Hangzhou, Peoples R China
[3] Univ Sci & Technol China, Sch Informat Sci & Technol, Hefei, Peoples R China
[4] Chinese Acad Sci Inst Automa, NLPR, Beijing, Peoples R China
[5] CASIA, NLPR, Beijing, Peoples R China
[6] Fudan Univ, Handan Campus, Shanghai, Peoples R China
关键词
Referring expression comprehension; multi-modal understanding; multi- task learning; NAVIGATION;
D O I
10.1145/3660638
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Referring expression comprehension aims to align natural language queries with visual scenes, which requires establishing fine-grained correspondence between vision and language. This has important applications in multi-modal reasoning systems. Existing methods typically use text-agnostic visual backbones to extract features independently without considering the specific text input. However, we argue that the extracted visual features can be inconsistent with the referring expression, which hurts multi-modal understanding. To address this, we first propose Query-modulated Refinement Network (QRNet) that leverages language guidance to guide visual feature extraction. However, it only focuses on the grounding task that can only provide coarse-grained annotations in the form of bounding box coordinates. The guidance for the visual backbone is indirect, and the inconsistent issue still exists. To this end, we further propose UniQRNet, a multi-task framework over the QRNet to learn referring expression grounding and segmentation jointly. The framework introduces a multi-task head that leverages fine-grained pixel-level supervision from the segmentation task to directly guide the intermediate layers of QRNet to learn text-consistent visual features. Besides, UniQRNet also includes a loss balance strategy that allows two types of supervision signals to cooperate and optimize the model together. We conduct the most comprehensive comparison experiment covering four major datasets, ten evaluation set and three evaluation metrics used in previous work. UniQRNet outperforms previous state-of-the-art methods by a large margin on both referring comprehensive grounding (1.8%similar to 5.09%) and segmentation tasks (0.57%similar to 5.56%). Ablation and analysis reveal that UniQRNet can improve the consistency of visual features with text input and can bring significant performance improvement.
引用
收藏
页数:28
相关论文
共 50 条
  • [41] Fully and Weakly Supervised Referring Expression Segmentation With End-to-End Learning
    Li, Hui
    Sun, Mingjie
    Xiao, Jimin
    Lim, Eng Gee
    Zhao, Yao
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (10) : 5999 - 6012
  • [42] PATRON: Perspective-Aware Multitask Model for Referring Expression Grounding Using Embodied Multimodal Cues
    Islam, Md Mofijul
    Gladstone, Alexi
    Iqbal, Tariq
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 1, 2023, : 971 - 979
  • [43] Relationship-Embedded Representation Learning for Grounding Referring Expressions
    Yang, Sibei
    Li, Guanbin
    Yu, Yizhou
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (08) : 2765 - 2779
  • [44] Hierarchical collaboration for referring image segmentation
    Zhang, Wei
    Cheng, Zesen
    Chen, Jie
    Gao, Wen
    NEUROCOMPUTING, 2025, 613
  • [45] A Robust Algorithm: Find an Unknown Person via Referring Grounding
    Wang, Xiping
    Wu, Feng
    Lu, Dongcai
    Chen, Xiaoping
    ROBOCUP 2017: ROBOT WORLD CUP XXI, 2018, 11175 : 228 - 240
  • [46] Cross-Modal Relationship Inference for Grounding Referring Expressions
    Yang, Sibei
    Li, Guanbin
    Yu, Yizhou
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 4140 - 4149
  • [47] Toward Robust Referring Image Segmentation
    Wu, Jianzong
    Li, Xiangtai
    Li, Xia
    Ding, Henghui
    Tong, Yunhai
    Tao, Dacheng
    IEEE Transactions on Image Processing, 2024, 33 : 1782 - 1794
  • [48] Methods for Referring Video Object Segmentation
    Wei, Caiying
    Jia, Lei
    Computer Engineering and Applications, 61 (02): : 73 - 83
  • [49] Video Object Segmentation with Referring Expressions
    Khoreva, Anna
    Rohrbach, Anna
    Schiele, Bernt
    COMPUTER VISION - ECCV 2018 WORKSHOPS, PT IV, 2019, 11132 : 7 - 12
  • [50] Toward Robust Referring Image Segmentation
    Wu, Jianzong
    Li, Xiangtai
    Li, Xia
    Ding, Henghui
    Tong, Yunhai
    Tao, Dacheng
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 1782 - 1794