Hierarchical cross-modal contextual attention network for visual grounding

被引:0
|
作者
Xin Xu
Gang Lv
Yining Sun
Yuxia Hu
Fudong Nian
机构
[1] Hefei University,School of Advanced Manufacturing Engineering
[2] Hefei Comprehensive National Science Center,Institute of Artificial Intelligence
[3] Anhui Jianzhu University,Anhui International Joint Research Center for Ancient Architecture Intellisencing and Multi
[4] University of Science and Technology of China,Dimensional Modeling
[5] Chinese Academy of Sciences,School of Information Science and Technology
来源
Multimedia Systems | 2023年 / 29卷
关键词
Visual grounding; Transformer; Multi-modal attention; Deep learning;
D O I
暂无
中图分类号
学科分类号
摘要
This paper explores the task of visual grounding (VG), which aims to localize regions of an image through sentence queries. The development of VG has significantly advanced with Transformer-based frameworks, which can capture image and text contexts without proposals. However, previous research has rarely explored hierarchical semantics and cross-interactions between two uni-modal encoders. Therefore, this paper proposes a Hierarchical Cross-modal Contextual Attention Network (HCCAN) for the VG task. The HCCAN model utilizes a visual-guided text contextual attention module, a text-guided visual contextual attention module, and a Transformer-based multi-modal feature fusion module. This approach not only captures intra-modality and inter-modality relationships through self-attention mechanisms but also captures the hierarchical semantics of textual and visual content in a common space. Experiments conducted on four standard benchmarks, including Flickr30K Entities and RefCOCO, RefCOCO+, RefCOCOg, demonstrate the effectiveness of the proposed method. The code is publicly available at https://www.github.com/cutexin66/HCCAN.
引用
收藏
页码:2073 / 2083
页数:10
相关论文
共 50 条
  • [41] CCL: Cross-modal Correlation Learning With Multigrained Fusion by Hierarchical Network
    Peng, Yuxin
    Qi, Jinwei
    Huang, Xin
    Yuan, Yuxin
    IEEE TRANSACTIONS ON MULTIMEDIA, 2018, 20 (02) : 405 - 420
  • [42] Cross-modal complementary network with hierarchical fusion for multimodal sentiment classification
    Peng, Cheng
    Zhang, Chunxia
    Xue, Xiaojun
    Gao, Jiameng
    Liang, Hongjian
    Niu, Zhengdong
    TSINGHUA SCIENCE AND TECHNOLOGY, 2022, 27 (04) : 664 - 679
  • [43] Cross-Modal Relationship Inference for Grounding Referring Expressions
    Yang, Sibei
    Li, Guanbin
    Yu, Yizhou
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 4140 - 4149
  • [44] Cross-Modal Complementary Network with Hierarchical Fusion for Multimodal Sentiment Classification
    Cheng Peng
    Chunxia Zhang
    Xiaojun Xue
    Jiameng Gao
    Hongjian Liang
    Zhengdong Niu
    TsinghuaScienceandTechnology, 2022, 27 (04) : 664 - 679
  • [45] Cross-Modal Omni Interaction Modeling for Phrase Grounding
    Yu, Tianyu
    Hui, Tianrui
    Yu, Zhihao
    Liao, Yue
    Yu, Sansi
    Zhang, Faxi
    Liu, Si
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 1725 - 1734
  • [46] Hierarchical Multi-modal Contextual Attention Network for Fake News Detection
    Qian, Shengsheng
    Wang, Jinguang
    Hu, Jun
    Fang, Quan
    Xu, Changsheng
    SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 153 - 162
  • [47] Dual-supervised attention network for deep cross-modal hashing
    Peng, Hanyu
    He, Junjun
    Chen, Shifeng
    Wang, Yali
    Qiao, Yu
    PATTERN RECOGNITION LETTERS, 2019, 128 : 333 - 339
  • [48] Cross-Modal Self-Attention Network for Referring Image Segmentation
    Ye, Linwei
    Rochan, Mrigank
    Liu, Zhi
    Wang, Yang
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 10494 - 10503
  • [49] Cross-modal Attention Network with Orthogonal Latent Memory for Rumor Detection
    Wu, Zekai
    Chen, Jiaxin
    Yang, Zhenguo
    Xie, Haoran
    Wang, Fu Lee
    Liu, Wenyin
    WEB INFORMATION SYSTEMS ENGINEERING - WISE 2021, PT I, 2021, 13080 : 527 - 541
  • [50] CMAAN: Cross-Modal Aggregation Attention Network for Next POI Recommendation
    Zhuang, Zhuang
    Liu, Lingbo
    Qi, Heng
    Shen, Yanming
    Yin, Baocai
    IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2024,