Prompt-guided bidirectional deep fusion network for referring image segmentation

被引:0
|
作者
Wu, Junxian [1 ,2 ]
Zhang, Yujia [1 ]
Kampffmeyer, Michael [3 ]
Zhao, Xiaoguang [1 ]
机构
[1] Chinese Acad Sci, Inst Automat, State Key Lab Multimodal Artificial Intelligence S, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China
[3] UiT Arctic Univ Norway, Dept Phys & Technol, Tromso, Norway
基金
中国国家自然科学基金;
关键词
Referring image segmentation; Prompt-guided bidirectional encoder fusion; Prompt-guided cross-modal interaction;
D O I
10.1016/j.neucom.2024.128899
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Referring image segmentation involves accurately segmenting objects based on natural language descriptions. This poses challenges due to the intricate and varied nature of language expressions, as well as the requirement to identify relevant image regions among multiple objects. Current models predominantly employ language- aware early fusion techniques, which may lead to misinterpretations of language expressions due to the lack of explicit visual guidance of the language encoder. Additionally, early fusion methods are unable to adequately leverage high-level contexts. To address these limitations, this paper introduces the Prompt-guided Bidirectional Deep Fusion Network (PBDF-Net) to enhance the fusion of language and vision modalities. In contrast to traditional unidirectional early fusion approaches, our approach employs a prompt-guided bidirectional encoder fusion (PBEF) module to promote mutual cross-modal fusion across multiple stages of the vision and language encoders. Furthermore, PBDF-Net incorporates a prompt-guided cross-modal interaction (PCI) module during the late fusion stage, facilitating amore profound integration of contextual information from both modalities, resulting in more accurate target segmentation. Comprehensive experiments conducted on the RefCOCO, RefCOCO+, G-Ref and ReferIt datasets substantiate the efficacy of our proposed method, demonstrating significant advancements in performance compared to existing approaches.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Multiscale deep feature selection fusion network for referring image segmentation
    Xianwen Dai
    Jiacheng Lin
    Ke Nai
    Qingpeng Li
    Zhiyong Li
    Multimedia Tools and Applications, 2024, 83 : 36287 - 36305
  • [2] Multiscale deep feature selection fusion network for referring image segmentation
    Dai, Xianwen
    Lin, Jiacheng
    Nai, Ke
    Li, Qingpeng
    Li, Zhiyong
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (12) : 36287 - 36305
  • [3] Bidirectional Relationship Inferring Network for Referring Image Localization and Segmentation
    Feng, Guang
    Hu, Zhiwei
    Zhang, Lihe
    Sun, Jiayu
    Lu, Huchuan
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (05) : 2246 - 2258
  • [4] Structured Multimodal Fusion Network for Referring Image Segmentation
    Xue, Mingcheng
    Liu, Yu
    Xu, Kaiping
    Zhang, Haiyang
    Yu, Chengyang
    PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2022, 2022, : 36 - 47
  • [5] DCMFNet: Deep Cross-Modal Fusion Network for Referring Image Segmentation with Iterative Gated Fusion
    Huang, Zhen
    Xue, Mingcheng
    Liu, Yu
    Xu, Kaiping
    Li, Jiangquan
    Yu, Chenyang
    PROCEEDINGS OF THE 50TH GRAPHICS INTERFACE CONFERENCE, GI 2024, 2024,
  • [6] Prompt-Guided Sparse Transformer for Remote Sensing Image Dehazing
    Dong, Haobo
    Song, Tianyu
    Qi, Xuanyu
    Jin, Guiyue
    Jin, Jiyu
    Ma, Ling
    IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2024, 21
  • [7] PROMPTCAP: Prompt-Guided Image Captioning for VQA with GPT-3
    Hu, Yushi
    Hua, Hang
    Yang, Zhengyuan
    Shi, Weijia
    Smith, Noah A.
    Luo, Jiebo
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2951 - 2963
  • [8] MVPN: Multi-granularity visual prompt-guided fusion network for multimodal named entity recognition
    Liu, Wei
    Ren, Aiqun
    Wang, Chao
    Peng, Yan
    Xie, Shaorong
    Li, Weimin
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (28) : 71639 - 71663
  • [9] Low-Rank Prompt-Guided Transformer for Hyperspectral Image Denoising
    Tan, Xiaodong
    Shao, Mingwen
    Qiao, Yuanjian
    Liu, Tiyao
    Cao, Xiangyong
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
  • [10] Prompt-Guided Semantic-Aware Distillation for Weakly Supervised Incremental Semantic Segmentation
    Hao, Xuze
    Jiang, Xuhao
    Ni, Wenqian
    Tan, Weimin
    Yan, Bo
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (11) : 10632 - 10645