Bi-Directional Spatial-Semantic Attention Networks for Image-Text Matching

被引：74

作者：

Huang, Feiran ^{[1
]}

Zhang, Xiaoming ^{[2
]}

Zhao, Zhonghua ^{[3
]}

Li, Zhoujun ^{[4
]}

机构：

[1] Beihang Univ, Beijing Key Lab Network Technol, Beijing 100191, Peoples R China

[2] Beihang Univ, Sch Cyber Sci & Technol, Beijing 100191, Peoples R China

[3] Coordinat Ctr China, Natl Comp Emergency Tech Team, Beijing 100029, Peoples R China

[4] Beihang Univ, Sch Comp Sci & Engn, State Key Lab Software Dev Environm, Beijing 100191, Peoples R China

来源：

IEEE TRANSACTIONS ON IMAGE PROCESSING | 2019年 / 28卷 / 04期

基金：

北京市自然科学基金; 中国国家自然科学基金;

关键词：

Image-text matching; attention networks; deep learning; spatial-semantic;

D O I：

10.1109/TIP.2018.2882225

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Image-text matching by deep models has recently made remarkable achievements in many tasks, such as image caption and image search. A major challenge of matching the image and text lies in that they usually have complicated underlying relations between them and simply modeling the relations may lead to suboptimal performance. In this paper, we develop a novel approach bi-directional spatial-semantic attention network, which leverages both the word to regions (W2R) relation and visual object to words (O2W) relation in a holistic deep framework for more effectively matching. Specifically, to effectively encode the W2R relation, we adopt LSTM with bilinear attention function to infer the image regions which are more related to the particular words, which is referred as the W2R attention networks. On the other side, the O2W attention networks are proposed to discover the semantically close words for each visual object in the image, i.e., the visual O2W relation. Then, a deep model unifying both of the two directional attention networks into a holistic learning framework is proposed to learn the matching scores of image and text pairs. Compared to the existing image-text matching methods, our approach achieves state-of-the-art performance on the datasets of Flickr30K and MSCOCO.

引用

页码：2008 / 2020

页数：13

共 50 条

[31] Cross-modal Semantic Interference Suppression for image-text matching
Yao, Tao
Peng, Shouyong
Sun, Yujuan
Sheng, Guorui
Fu, Haiyan
Kong, Xiangwei
ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 133
[32] Cross-modal Semantic Interference Suppression for image-text matching
Yao, Tao
Peng, Shouyong
Sun, Yujuan
Sheng, Guorui
Fu, Haiyan
Kong, Xiangwei
Engineering Applications of Artificial Intelligence, 2024, 133
[33] Local Alignment with Global Semantic Consistence Network for Image-Text Matching
Li, Pengwei
Wu, Shihua
Lian, Zhichao
2022 IEEE INTL CONF ON DEPENDABLE, AUTONOMIC AND SECURE COMPUTING, INTL CONF ON PERVASIVE INTELLIGENCE AND COMPUTING, INTL CONF ON CLOUD AND BIG DATA COMPUTING, INTL CONF ON CYBER SCIENCE AND TECHNOLOGY CONGRESS (DASC/PICOM/CBDCOM/CYBERSCITECH), 2022, : 652 - 657
[34] Regularizing Visual Semantic Embedding With Contrastive Learning for Image-Text Matching
Liu, Yang
Liu, Hong
Wang, Huaqiu
Liu, Mengyuan
IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 1332 - 1336
[35] Uniting Image and Text Deep Networks via Bi-directional Triplet Loss for Retreival
Hua, Yan
Du, Jianhe
PROCEEDINGS OF 2019 IEEE 9TH INTERNATIONAL CONFERENCE ON ELECTRONICS INFORMATION AND EMERGENCY COMMUNICATION (ICEIEC 2019), 2019, : 297 - 300
[36] Unified Adaptive Relevance Distinguishable Attention Network for Image-Text Matching
Zhang, Kun
Mao, Zhendong
Liu, An-An
Zhang, Yongdong
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 1320 - 1332
[37] Global-Guided Asymmetric Attention Network for Image-Text Matching
Wu, Dongqing
Li, Huihui
Tang, Yinge
Guo, Lei
Liu, Hang
NEUROCOMPUTING, 2022, 481 : 77 - 90
[38] Learning Fragment Self-Attention Embeddings for Image-Text Matching
Wu, Yiling
Wang, Shuhui
Song, Guoli
Huang, Qingming
PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 2088 - 2096
[39] Self-attention guided representation learning for image-text matching
Qi, Xuefei
Zhang, Ying
Qi, Jinqing
Lu, Huchuan
NEUROCOMPUTING, 2021, 450 : 143 - 155
[40] Global-Guided Asymmetric Attention Network for Image-Text Matching
Wu, Dongqing
Li, Huihui
Tang, Yinge
Guo, Lei
Liu, Hang
Neurocomputing, 2022, 481 : 77 - 90

← 1 2 3 4 5 →