Hierarchical cross-modal contextual attention network for visual grounding

被引：0

作者：

Xin Xu

Gang Lv

Yining Sun

Yuxia Hu

Fudong Nian

机构：

[1] Hefei University,School of Advanced Manufacturing Engineering

[2] Hefei Comprehensive National Science Center,Institute of Artificial Intelligence

[3] Anhui Jianzhu University,Anhui International Joint Research Center for Ancient Architecture Intellisencing and Multi

[4] University of Science and Technology of China,Dimensional Modeling

[5] Chinese Academy of Sciences,School of Information Science and Technology

来源：

Multimedia Systems | 2023年 / 29卷

关键词：

Visual grounding; Transformer; Multi-modal attention; Deep learning;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

This paper explores the task of visual grounding (VG), which aims to localize regions of an image through sentence queries. The development of VG has significantly advanced with Transformer-based frameworks, which can capture image and text contexts without proposals. However, previous research has rarely explored hierarchical semantics and cross-interactions between two uni-modal encoders. Therefore, this paper proposes a Hierarchical Cross-modal Contextual Attention Network (HCCAN) for the VG task. The HCCAN model utilizes a visual-guided text contextual attention module, a text-guided visual contextual attention module, and a Transformer-based multi-modal feature fusion module. This approach not only captures intra-modality and inter-modality relationships through self-attention mechanisms but also captures the hierarchical semantics of textual and visual content in a common space. Experiments conducted on four standard benchmarks, including Flickr30K Entities and RefCOCO, RefCOCO+, RefCOCOg, demonstrate the effectiveness of the proposed method. The code is publicly available at https://www.github.com/cutexin66/HCCAN.

引用

页码：2073 / 2083

页数：10

共 50 条

[41] CCL: Cross-modal Correlation Learning With Multigrained Fusion by Hierarchical Network
Peng, Yuxin
Qi, Jinwei
Huang, Xin
Yuan, Yuxin
IEEE TRANSACTIONS ON MULTIMEDIA, 2018, 20 (02) : 405 - 420
[42] Cross-modal complementary network with hierarchical fusion for multimodal sentiment classification
Peng, Cheng
Zhang, Chunxia
Xue, Xiaojun
Gao, Jiameng
Liang, Hongjian
Niu, Zhengdong
TSINGHUA SCIENCE AND TECHNOLOGY, 2022, 27 (04) : 664 - 679
[43] Cross-Modal Relationship Inference for Grounding Referring Expressions
Yang, Sibei
Li, Guanbin
Yu, Yizhou
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 4140 - 4149
[44] Cross-Modal Complementary Network with Hierarchical Fusion for Multimodal Sentiment Classification
Cheng Peng
Chunxia Zhang
Xiaojun Xue
Jiameng Gao
Hongjian Liang
Zhengdong Niu
TsinghuaScienceandTechnology, 2022, 27 (04) : 664 - 679
[45] Cross-Modal Omni Interaction Modeling for Phrase Grounding
Yu, Tianyu
Hui, Tianrui
Yu, Zhihao
Liao, Yue
Yu, Sansi
Zhang, Faxi
Liu, Si
MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 1725 - 1734
[46] Hierarchical Multi-modal Contextual Attention Network for Fake News Detection
Qian, Shengsheng
Wang, Jinguang
Hu, Jun
Fang, Quan
Xu, Changsheng
SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 153 - 162
[47] Dual-supervised attention network for deep cross-modal hashing
Peng, Hanyu
He, Junjun
Chen, Shifeng
Wang, Yali
Qiao, Yu
PATTERN RECOGNITION LETTERS, 2019, 128 : 333 - 339
[48] Cross-Modal Self-Attention Network for Referring Image Segmentation
Ye, Linwei
Rochan, Mrigank
Liu, Zhi
Wang, Yang
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 10494 - 10503
[49] Cross-modal Attention Network with Orthogonal Latent Memory for Rumor Detection
Wu, Zekai
Chen, Jiaxin
Yang, Zhenguo
Xie, Haoran
Wang, Fu Lee
Liu, Wenyin
WEB INFORMATION SYSTEMS ENGINEERING - WISE 2021, PT I, 2021, 13080 : 527 - 541
[50] CMAAN: Cross-Modal Aggregation Attention Network for Next POI Recommendation
Zhuang, Zhuang
Liu, Lingbo
Qi, Heng
Shen, Yanming
Yin, Baocai
IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2024,

← 1 2 3 4 5 →