Hierarchical cross-modal contextual attention network for visual grounding

被引:0
|
作者
Xin Xu
Gang Lv
Yining Sun
Yuxia Hu
Fudong Nian
机构
[1] Hefei University,School of Advanced Manufacturing Engineering
[2] Hefei Comprehensive National Science Center,Institute of Artificial Intelligence
[3] Anhui Jianzhu University,Anhui International Joint Research Center for Ancient Architecture Intellisencing and Multi
[4] University of Science and Technology of China,Dimensional Modeling
[5] Chinese Academy of Sciences,School of Information Science and Technology
来源
Multimedia Systems | 2023年 / 29卷
关键词
Visual grounding; Transformer; Multi-modal attention; Deep learning;
D O I
暂无
中图分类号
学科分类号
摘要
This paper explores the task of visual grounding (VG), which aims to localize regions of an image through sentence queries. The development of VG has significantly advanced with Transformer-based frameworks, which can capture image and text contexts without proposals. However, previous research has rarely explored hierarchical semantics and cross-interactions between two uni-modal encoders. Therefore, this paper proposes a Hierarchical Cross-modal Contextual Attention Network (HCCAN) for the VG task. The HCCAN model utilizes a visual-guided text contextual attention module, a text-guided visual contextual attention module, and a Transformer-based multi-modal feature fusion module. This approach not only captures intra-modality and inter-modality relationships through self-attention mechanisms but also captures the hierarchical semantics of textual and visual content in a common space. Experiments conducted on four standard benchmarks, including Flickr30K Entities and RefCOCO, RefCOCO+, RefCOCOg, demonstrate the effectiveness of the proposed method. The code is publicly available at https://www.github.com/cutexin66/HCCAN.
引用
收藏
页码:2073 / 2083
页数:10
相关论文
共 50 条
  • [1] Hierarchical cross-modal contextual attention network for visual grounding
    Xu, Xin
    Lv, Gang
    Sun, Yining
    Hu, Yuxia
    Nian, Fudong
    MULTIMEDIA SYSTEMS, 2023, 29 (04) : 2073 - 2083
  • [2] Semantic-Aligned Cross-Modal Visual Grounding Network with Transformers
    Zhang, Qianjun
    Yuan, Jin
    APPLIED SCIENCES-BASEL, 2023, 13 (09):
  • [3] CHAN: Cross-Modal Hybrid Attention Network for Temporal Language Grounding in Videos
    Wang, Wen
    Zhong, Ling
    Gao, Guang
    Wan, Minhong
    Gu, Jason
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 1499 - 1504
  • [4] Cross-modal orienting of visual attention
    Hillyard, Steven A.
    Stoermer, Viola S.
    Feng, Wenfeng
    Martinez, Antigona
    McDonald, John J.
    NEUROPSYCHOLOGIA, 2016, 83 : 170 - 178
  • [5] Learning Cross-Modal Context Graph for Visual Grounding
    Liu, Yongfei
    Wan, Bo
    Zhu, Xiaodan
    He, Xuming
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 11645 - 11652
  • [6] Cross-modal contextual memory guides selective attention in visual-search tasks
    Chen, Siyi
    Shi, Zhuanghua
    Zinchenko, Artyom
    Mueller, Hermann J.
    Geyer, Thomas
    PSYCHOPHYSIOLOGY, 2022, 59 (07)
  • [7] Cross-modal exogenous visual selective attention
    Zhao, C
    Yang, H
    Zhang, K
    INTERNATIONAL JOURNAL OF PSYCHOLOGY, 2000, 35 (3-4) : 100 - 100
  • [8] Utilizing visual attention for cross-modal coreference interpretation
    Byron, D
    Mampilly, T
    Sharma, V
    Xu, TF
    MODELING AND USING CONTEXT, PROCEEDINGS, 2005, 3554 : 83 - 96
  • [9] Bridging Modality Gap for Visual Grounding with Effecitve Cross-Modal Distillation
    Wang, Jiaxi
    Hu, Wenhui
    Liu, Xueyang
    Wu, Beihu
    Qiu, Yuting
    Cai, YingYing
    PATTERN RECOGNITION AND COMPUTER VISION, PT V, PRCV 2024, 2025, 15035 : 347 - 363
  • [10] Cross-Modal Multistep Fusion Network With Co-Attention for Visual Question Answering
    Lao, Mingrui
    Guo, Yanming
    Wang, Hui
    Zhang, Xin
    IEEE ACCESS, 2018, 6 : 31516 - 31524