UniQRNet: Unifying Referring Expression Grounding and Segmentation with QRNet

被引:0
|
作者
Ye, Jiabo [1 ]
Tian, Junfeng [2 ]
Yan, Ming [2 ]
Xu, Haiyang [2 ]
Ye, Qinghao [2 ]
Shi, Yaya [3 ,4 ]
Yang, Xiaoshan [5 ]
Wang, Xuwu [6 ]
Zhang, Ji [2 ]
He, Liang [1 ]
Lin, Xin [1 ]
机构
[1] East China Normal Univ, Shanghai, Peoples R China
[2] Alibaba Grp, Hangzhou, Peoples R China
[3] Univ Sci & Technol China, Sch Informat Sci & Technol, Hefei, Peoples R China
[4] Chinese Acad Sci Inst Automa, NLPR, Beijing, Peoples R China
[5] CASIA, NLPR, Beijing, Peoples R China
[6] Fudan Univ, Handan Campus, Shanghai, Peoples R China
关键词
Referring expression comprehension; multi-modal understanding; multi- task learning; NAVIGATION;
D O I
10.1145/3660638
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Referring expression comprehension aims to align natural language queries with visual scenes, which requires establishing fine-grained correspondence between vision and language. This has important applications in multi-modal reasoning systems. Existing methods typically use text-agnostic visual backbones to extract features independently without considering the specific text input. However, we argue that the extracted visual features can be inconsistent with the referring expression, which hurts multi-modal understanding. To address this, we first propose Query-modulated Refinement Network (QRNet) that leverages language guidance to guide visual feature extraction. However, it only focuses on the grounding task that can only provide coarse-grained annotations in the form of bounding box coordinates. The guidance for the visual backbone is indirect, and the inconsistent issue still exists. To this end, we further propose UniQRNet, a multi-task framework over the QRNet to learn referring expression grounding and segmentation jointly. The framework introduces a multi-task head that leverages fine-grained pixel-level supervision from the segmentation task to directly guide the intermediate layers of QRNet to learn text-consistent visual features. Besides, UniQRNet also includes a loss balance strategy that allows two types of supervision signals to cooperate and optimize the model together. We conduct the most comprehensive comparison experiment covering four major datasets, ten evaluation set and three evaluation metrics used in previous work. UniQRNet outperforms previous state-of-the-art methods by a large margin on both referring comprehensive grounding (1.8%similar to 5.09%) and segmentation tasks (0.57%similar to 5.56%). Ablation and analysis reveal that UniQRNet can improve the consistency of visual features with text input and can bring significant performance improvement.
引用
收藏
页数:28
相关论文
共 50 条
  • [1] Mask Grounding for Referring Image Segmentation
    Chng, Yong Xien
    Zheng, Henry
    Han, Yizeng
    Qiu, Xuchong
    Huang, Gao
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 26563 - 26573
  • [2] Rethinking Two-Stage Referring Expression Comprehension: A Novel Grounding and Segmentation Method Modulated by Point
    Zhao, Peizhi
    Zheng, Shiyi
    Zhao, Wenye
    Xu, Dongsheng
    Li, Pijian
    Cai, Yi
    Huang, Qingbao
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7, 2024, : 7487 - 7495
  • [3] Referring expression grounding by multi-context reasoning
    Wang, Xing
    Xie, De
    Zheng, Yuanshi
    PATTERN RECOGNITION LETTERS, 2022, 160 : 66 - 72
  • [4] Universal Relocalizer forWeakly Supervised Referring Expression Grounding
    Zhang, Panpan
    Liu, Meng
    Song, Xuemeng
    Cao, Da
    Gao, Zan
    Nie, Liqiang
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (07)
  • [5] GRES: Generalized Referring Expression Segmentation
    Liu, Chang
    Ding, Henghui
    Jiang, Xudong
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23592 - 23601
  • [6] Meta Compositional Referring Expression Segmentation
    Xu, Li
    Huang, Mark He
    Shang, Xindi
    Yuan, Zehuan
    Sun, Ying
    Liu, Jun
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19478 - 19487
  • [7] Adaptive Reconstruction Network for Weakly Supervised Referring Expression Grounding
    Liu, Xuejing
    Li, Liang
    Wang, Shuhui
    Zha, Zheng-Jun
    Meng, Dechao
    Huang, Qingming
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 2611 - 2620
  • [8] Discriminative Triad Matching and Reconstruction for Weakly Referring Expression Grounding
    Sun, Mingjie
    Xiao, Jimin
    Lim, Eng Gee
    Liu, Si
    Goulermas, John Y.
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (11) : 4189 - 4195
  • [9] Transferrable Referring Expression Grounding with Concept Transfer and Context Inheritance
    Liu, Xuejing
    Li, Liang
    Wang, Shuhui
    Zha, Zheng-Jun
    Meng, Dechao
    Huang, Qingming
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 3938 - 3946
  • [10] Audio-Visual Grounding Referring Expression for Robotic Manipulation
    Wang, Yefei
    Wang, Kaili
    Wang, Yi
    Guo, Di
    Liu, Huaping
    Sun, Fuchun
    2022 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, ICRA 2022, 2022, : 9258 - 9264