Bi-Directional Spatial-Semantic Attention Networks for Image-Text Matching

被引:74
|
作者
Huang, Feiran [1 ]
Zhang, Xiaoming [2 ]
Zhao, Zhonghua [3 ]
Li, Zhoujun [4 ]
机构
[1] Beihang Univ, Beijing Key Lab Network Technol, Beijing 100191, Peoples R China
[2] Beihang Univ, Sch Cyber Sci & Technol, Beijing 100191, Peoples R China
[3] Coordinat Ctr China, Natl Comp Emergency Tech Team, Beijing 100029, Peoples R China
[4] Beihang Univ, Sch Comp Sci & Engn, State Key Lab Software Dev Environm, Beijing 100191, Peoples R China
基金
北京市自然科学基金; 中国国家自然科学基金;
关键词
Image-text matching; attention networks; deep learning; spatial-semantic;
D O I
10.1109/TIP.2018.2882225
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Image-text matching by deep models has recently made remarkable achievements in many tasks, such as image caption and image search. A major challenge of matching the image and text lies in that they usually have complicated underlying relations between them and simply modeling the relations may lead to suboptimal performance. In this paper, we develop a novel approach bi-directional spatial-semantic attention network, which leverages both the word to regions (W2R) relation and visual object to words (O2W) relation in a holistic deep framework for more effectively matching. Specifically, to effectively encode the W2R relation, we adopt LSTM with bilinear attention function to infer the image regions which are more related to the particular words, which is referred as the W2R attention networks. On the other side, the O2W attention networks are proposed to discover the semantically close words for each visual object in the image, i.e., the visual O2W relation. Then, a deep model unifying both of the two directional attention networks into a holistic learning framework is proposed to learn the matching scores of image and text pairs. Compared to the existing image-text matching methods, our approach achieves state-of-the-art performance on the datasets of Flickr30K and MSCOCO.
引用
收藏
页码:2008 / 2020
页数:13
相关论文
共 50 条
  • [41] Unifying Multimodal Transformer for Bi-directional Image and Text Generation
    Huang, Yupan
    Xue, Hongwei
    Liu, Bei
    Lu, Yutong
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 1138 - 1147
  • [42] Bi-Directional Co-Attention Network for Image Captioning
    Jiang, Weitao
    Wang, Weixuan
    Hu, Haifeng
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2021, 17 (04)
  • [43] Bi-Directional Seed Attention Network for Interactive Image Segmentation
    Song, Gwangmo
    Lee, Kyoung Mu
    IEEE SIGNAL PROCESSING LETTERS, 2020, 27 : 1540 - 1544
  • [44] EXPLORING ENTITY-LEVEL SPATIAL RELATIONSHIPS FOR IMAGE-TEXT MATCHING
    Xia, Yaxian
    Huang, Lun
    Wang, Wenmin
    Wei, Xiao-Yong
    Chen, Jie
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 4452 - 4456
  • [45] Multi-level Symmetric Semantic Alignment Network for image-text matching
    Wang, Wenzhuang
    Di, Xiaoguang
    Liu, Maozhen
    Gao, Feng
    NEUROCOMPUTING, 2024, 599
  • [46] Image-text matching algorithm based on multi-level semantic alignment
    Li Y.
    Yao T.
    Zhang L.
    Sun Y.
    Fu H.
    Beijing Hangkong Hangtian Daxue Xuebao/Journal of Beijing University of Aeronautics and Astronautics, 2024, 50 (02): : 551 - 558
  • [47] Unlocking the Power of Cross-Dimensional Semantic Dependency for Image-Text Matching
    Zhang, Kun
    Zhang, Lei
    Hu, Bo
    Zhu, Mengxiao
    Mao, Zhendong
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4828 - 4837
  • [48] An end-to-end image-text matching approach considering semantic uncertainty
    Tuerhong, Gulanbaier
    Dai, Xin
    Tian, Liwei
    Wushouer, Mairidan
    NEUROCOMPUTING, 2024, 607
  • [49] Image enhancement with bi-directional normalization and color attention-guided generative adversarial networks
    Shan Liu
    Shihao Shan
    Guoqiang Xiao
    Xinbo Gao
    Song Wu
    International Journal of Multimedia Information Retrieval, 2024, 13
  • [50] Image enhancement with bi-directional normalization and color attention-guided generative adversarial networks
    Liu, Shan
    Shan, Shihao
    Xiao, Guoqiang
    Gao, Xinbo
    Wu, Song
    INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2024, 13 (01)