Hierarchical cross-modal contextual attention network for visual grounding

被引:0
|
作者
Xin Xu
Gang Lv
Yining Sun
Yuxia Hu
Fudong Nian
机构
[1] Hefei University,School of Advanced Manufacturing Engineering
[2] Hefei Comprehensive National Science Center,Institute of Artificial Intelligence
[3] Anhui Jianzhu University,Anhui International Joint Research Center for Ancient Architecture Intellisencing and Multi
[4] University of Science and Technology of China,Dimensional Modeling
[5] Chinese Academy of Sciences,School of Information Science and Technology
来源
Multimedia Systems | 2023年 / 29卷
关键词
Visual grounding; Transformer; Multi-modal attention; Deep learning;
D O I
暂无
中图分类号
学科分类号
摘要
This paper explores the task of visual grounding (VG), which aims to localize regions of an image through sentence queries. The development of VG has significantly advanced with Transformer-based frameworks, which can capture image and text contexts without proposals. However, previous research has rarely explored hierarchical semantics and cross-interactions between two uni-modal encoders. Therefore, this paper proposes a Hierarchical Cross-modal Contextual Attention Network (HCCAN) for the VG task. The HCCAN model utilizes a visual-guided text contextual attention module, a text-guided visual contextual attention module, and a Transformer-based multi-modal feature fusion module. This approach not only captures intra-modality and inter-modality relationships through self-attention mechanisms but also captures the hierarchical semantics of textual and visual content in a common space. Experiments conducted on four standard benchmarks, including Flickr30K Entities and RefCOCO, RefCOCO+, RefCOCOg, demonstrate the effectiveness of the proposed method. The code is publicly available at https://www.github.com/cutexin66/HCCAN.
引用
收藏
页码:2073 / 2083
页数:10
相关论文
共 50 条
  • [21] Multi-task hierarchical convolutional network for visual-semantic cross-modal retrieval
    Ji, Zhong
    Lin, Zhigang
    Wang, Haoran
    Pang, Yanwei
    Li, Xuelong
    PATTERN RECOGNITION, 2024, 151
  • [22] Asymmetric cross-modal attention network with multimodal augmented mixup for medical visual question answering
    Li, Yong
    Yang, Qihao
    Wang, Fu Lee
    Lee, Lap-Kei
    Qu, Yingying
    Hao, Tianyong
    ARTIFICIAL INTELLIGENCE IN MEDICINE, 2023, 144
  • [23] Adversarial Graph Attention Network for Multi-modal Cross-modal Retrieval
    Wu, Hongchang
    Guan, Ziyu
    Zhi, Tao
    zhao, Wei
    Xu, Cai
    Han, Hong
    Yang, Yarning
    2019 10TH IEEE INTERNATIONAL CONFERENCE ON BIG KNOWLEDGE (ICBK 2019), 2019, : 265 - 272
  • [24] BCAN: Bidirectional Correct Attention Network for Cross-Modal Retrieval
    Liu, Yang
    Liu, Hong
    Wang, Huaqiu
    Meng, Fanyang
    Liu, Mengyuan
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (10) : 14247 - 14258
  • [25] Attention as Grounding: Exploring Textual and Cross-Modal Attention on Entities and Relations in Language-and-Vision Transformer
    Ilinykh, Nikolai
    Dobnik, Simon
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 4062 - 4073
  • [26] Iterative graph attention memory network for cross-modal retrieval
    Dong, Xinfeng
    Zhang, Huaxiang
    Dong, Xiao
    Lu, Xu
    KNOWLEDGE-BASED SYSTEMS, 2021, 226
  • [27] Heterogeneous Attention Network for Effective and Efficient Cross-modal Retrieval
    Yu, Tan
    Yang, Yi
    Li, Yi
    Liu, Lin
    Fei, Hongliang
    Li, Ping
    SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 1146 - 1156
  • [28] Visual question answering with attention transfer and a cross-modal gating mechanism
    Li, Wei
    Sun, Jianhui
    Liu, Ge
    Zhao, Linglan
    Fang, Xiangzhong
    PATTERN RECOGNITION LETTERS, 2020, 133 (133) : 334 - 340
  • [29] Cross-modal attention guided visual reasoning for referring image segmentation
    Zhang, Wenjing
    Hu, Mengnan
    Tan, Quange
    Zhou, Qianli
    Wang, Rong
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (19) : 28853 - 28872
  • [30] Cross-modal attention guided visual reasoning for referring image segmentation
    Wenjing Zhang
    Mengnan Hu
    Quange Tan
    Qianli Zhou
    Rong Wang
    Multimedia Tools and Applications, 2023, 82 : 28853 - 28872