Bi-Directional Spatial-Semantic Attention Networks for Image-Text Matching

被引:74
|
作者
Huang, Feiran [1 ]
Zhang, Xiaoming [2 ]
Zhao, Zhonghua [3 ]
Li, Zhoujun [4 ]
机构
[1] Beihang Univ, Beijing Key Lab Network Technol, Beijing 100191, Peoples R China
[2] Beihang Univ, Sch Cyber Sci & Technol, Beijing 100191, Peoples R China
[3] Coordinat Ctr China, Natl Comp Emergency Tech Team, Beijing 100029, Peoples R China
[4] Beihang Univ, Sch Comp Sci & Engn, State Key Lab Software Dev Environm, Beijing 100191, Peoples R China
基金
北京市自然科学基金; 中国国家自然科学基金;
关键词
Image-text matching; attention networks; deep learning; spatial-semantic;
D O I
10.1109/TIP.2018.2882225
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Image-text matching by deep models has recently made remarkable achievements in many tasks, such as image caption and image search. A major challenge of matching the image and text lies in that they usually have complicated underlying relations between them and simply modeling the relations may lead to suboptimal performance. In this paper, we develop a novel approach bi-directional spatial-semantic attention network, which leverages both the word to regions (W2R) relation and visual object to words (O2W) relation in a holistic deep framework for more effectively matching. Specifically, to effectively encode the W2R relation, we adopt LSTM with bilinear attention function to infer the image regions which are more related to the particular words, which is referred as the W2R attention networks. On the other side, the O2W attention networks are proposed to discover the semantically close words for each visual object in the image, i.e., the visual O2W relation. Then, a deep model unifying both of the two directional attention networks into a holistic learning framework is proposed to learn the matching scores of image and text pairs. Compared to the existing image-text matching methods, our approach achieves state-of-the-art performance on the datasets of Flickr30K and MSCOCO.
引用
收藏
页码:2008 / 2020
页数:13
相关论文
共 50 条
  • [1] Bi-directional attention comparison for semantic sentence matching
    Huiyuan Lai
    Yizheng Tao
    Chunliu Wang
    Lunfan Xu
    Dingyong Tang
    Gongliang Li
    Multimedia Tools and Applications, 2020, 79 : 14609 - 14624
  • [2] Bi-directional attention comparison for semantic sentence matching
    Lai, Huiyuan
    Tao, Yizheng
    Wang, Chunliu
    Xu, Lunfan
    Tang, Dingyong
    Li, Gongliang
    MULTIMEDIA TOOLS AND APPLICATIONS, 2020, 79 (21-22) : 14609 - 14624
  • [3] Dual Semantic Relationship Attention Network for Image-Text Matching
    Wen, Keyu
    Gu, Xiaodong
    2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [4] PFAN plus plus : Bi-Directional Image-Text Retrieval With Position Focused Attention Network
    Wang, Yaxiong
    Yang, Hao
    Bai, Xiuxiu
    Qian, Xueming
    Ma, Lin
    Lu, Jing
    Li, Biao
    Fan, Xin
    IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 (23) : 3362 - 3376
  • [5] Semantic-Enhanced Attention Network for Image-Text Matching
    Zhou, Huanxiao
    Geng, Yushui
    Zhao, Jing
    Ma, Xishan
    PROCEEDINGS OF THE 2024 27 TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN, CSCWD 2024, 2024, : 1256 - 1261
  • [6] Cross-Modal Attention With Semantic Consistence for Image-Text Matching
    Xu, Xing
    Wang, Tan
    Yang, Yang
    Zuo, Lin
    Shen, Fumin
    Shen, Heng Tao
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2020, 31 (12) : 5412 - 5425
  • [7] Learning Dual Semantic Relations With Graph Attention for Image-Text Matching
    Wen, Keyu
    Gu, Xiaodong
    Cheng, Qingrong
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021, 31 (07) : 2866 - 2879
  • [8] Bi-Attention enhanced representation learning for image-text matching
    Tian, Yumin
    Ding, Aqiang
    Wang, Di
    Luo, Xuemei
    Wan, Bo
    Wang, Yifeng
    PATTERN RECOGNITION, 2023, 140
  • [9] SPATIAL-SEMANTIC ATTENTION FOR GROUNDED IMAGE CAPTIONING
    Hu, Wenzhe
    Wang, Lanxiao
    Xu, Linfeng
    2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 61 - 65
  • [10] Bi-directional Image-Text Matching Deep Learning-Based Approaches: Concepts, Methodologies, Benchmarks and Challenges
    Ebaid, Doaa B.
    Madbouly, Magda M.
    El-Zoghabi, Adel A.
    INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE SYSTEMS, 2023, 16 (01)