Hierarchical cross-modal contextual attention network for visual grounding

被引：0

作者：

Xin Xu

Gang Lv

Yining Sun

Yuxia Hu

Fudong Nian

机构：

[1] Hefei University,School of Advanced Manufacturing Engineering

[2] Hefei Comprehensive National Science Center,Institute of Artificial Intelligence

[3] Anhui Jianzhu University,Anhui International Joint Research Center for Ancient Architecture Intellisencing and Multi

[4] University of Science and Technology of China,Dimensional Modeling

[5] Chinese Academy of Sciences,School of Information Science and Technology

来源：

Multimedia Systems | 2023年 / 29卷

关键词：

Visual grounding; Transformer; Multi-modal attention; Deep learning;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

This paper explores the task of visual grounding (VG), which aims to localize regions of an image through sentence queries. The development of VG has significantly advanced with Transformer-based frameworks, which can capture image and text contexts without proposals. However, previous research has rarely explored hierarchical semantics and cross-interactions between two uni-modal encoders. Therefore, this paper proposes a Hierarchical Cross-modal Contextual Attention Network (HCCAN) for the VG task. The HCCAN model utilizes a visual-guided text contextual attention module, a text-guided visual contextual attention module, and a Transformer-based multi-modal feature fusion module. This approach not only captures intra-modality and inter-modality relationships through self-attention mechanisms but also captures the hierarchical semantics of textual and visual content in a common space. Experiments conducted on four standard benchmarks, including Flickr30K Entities and RefCOCO, RefCOCO+, RefCOCOg, demonstrate the effectiveness of the proposed method. The code is publicly available at https://www.github.com/cutexin66/HCCAN.

引用

页码：2073 / 2083

页数：10

共 50 条

[1] Hierarchical cross-modal contextual attention network for visual grounding
Xu, Xin
Lv, Gang
Sun, Yining
Hu, Yuxia
Nian, Fudong
MULTIMEDIA SYSTEMS, 2023, 29 (04) : 2073 - 2083
[2] Semantic-Aligned Cross-Modal Visual Grounding Network with Transformers
Zhang, Qianjun
Yuan, Jin
APPLIED SCIENCES-BASEL, 2023, 13 (09):
[3] CHAN: Cross-Modal Hybrid Attention Network for Temporal Language Grounding in Videos
Wang, Wen
Zhong, Ling
Gao, Guang
Wan, Minhong
Gu, Jason
2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 1499 - 1504
[4] Cross-modal orienting of visual attention
Hillyard, Steven A.
Stoermer, Viola S.
Feng, Wenfeng
Martinez, Antigona
McDonald, John J.
NEUROPSYCHOLOGIA, 2016, 83 : 170 - 178
[5] Learning Cross-Modal Context Graph for Visual Grounding
Liu, Yongfei
Wan, Bo
Zhu, Xiaodan
He, Xuming
THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 11645 - 11652
[6] Cross-modal contextual memory guides selective attention in visual-search tasks
Chen, Siyi
Shi, Zhuanghua
Zinchenko, Artyom
Mueller, Hermann J.
Geyer, Thomas
PSYCHOPHYSIOLOGY, 2022, 59 (07)
[7] Cross-modal exogenous visual selective attention
Zhao, C
Yang, H
Zhang, K
INTERNATIONAL JOURNAL OF PSYCHOLOGY, 2000, 35 (3-4) : 100 - 100
[8] Utilizing visual attention for cross-modal coreference interpretation
Byron, D
Mampilly, T
Sharma, V
Xu, TF
MODELING AND USING CONTEXT, PROCEEDINGS, 2005, 3554 : 83 - 96
[9] Bridging Modality Gap for Visual Grounding with Effecitve Cross-Modal Distillation
Wang, Jiaxi
Hu, Wenhui
Liu, Xueyang
Wu, Beihu
Qiu, Yuting
Cai, YingYing
PATTERN RECOGNITION AND COMPUTER VISION, PT V, PRCV 2024, 2025, 15035 : 347 - 363
[10] Cross-Modal Multistep Fusion Network With Co-Attention for Visual Question Answering
Lao, Mingrui
Guo, Yanming
Wang, Hui
Zhang, Xin
IEEE ACCESS, 2018, 6 : 31516 - 31524

← 1 2 3 4 5 →