Bi-Directional Spatial-Semantic Attention Networks for Image-Text Matching

被引:74
|
作者
Huang, Feiran [1 ]
Zhang, Xiaoming [2 ]
Zhao, Zhonghua [3 ]
Li, Zhoujun [4 ]
机构
[1] Beihang Univ, Beijing Key Lab Network Technol, Beijing 100191, Peoples R China
[2] Beihang Univ, Sch Cyber Sci & Technol, Beijing 100191, Peoples R China
[3] Coordinat Ctr China, Natl Comp Emergency Tech Team, Beijing 100029, Peoples R China
[4] Beihang Univ, Sch Comp Sci & Engn, State Key Lab Software Dev Environm, Beijing 100191, Peoples R China
基金
北京市自然科学基金; 中国国家自然科学基金;
关键词
Image-text matching; attention networks; deep learning; spatial-semantic;
D O I
10.1109/TIP.2018.2882225
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Image-text matching by deep models has recently made remarkable achievements in many tasks, such as image caption and image search. A major challenge of matching the image and text lies in that they usually have complicated underlying relations between them and simply modeling the relations may lead to suboptimal performance. In this paper, we develop a novel approach bi-directional spatial-semantic attention network, which leverages both the word to regions (W2R) relation and visual object to words (O2W) relation in a holistic deep framework for more effectively matching. Specifically, to effectively encode the W2R relation, we adopt LSTM with bilinear attention function to infer the image regions which are more related to the particular words, which is referred as the W2R attention networks. On the other side, the O2W attention networks are proposed to discover the semantically close words for each visual object in the image, i.e., the visual O2W relation. Then, a deep model unifying both of the two directional attention networks into a holistic learning framework is proposed to learn the matching scores of image and text pairs. Compared to the existing image-text matching methods, our approach achieves state-of-the-art performance on the datasets of Flickr30K and MSCOCO.
引用
收藏
页码:2008 / 2020
页数:13
相关论文
共 50 条
  • [21] Rare-aware attention network for image-text matching
    Wang, Yan
    Su, Yuting
    Li, Wenhui
    Sun, Zhengya
    Wei, Zhiqiang
    Nie, Jie
    Li, Xuanya
    Liu, An-An
    INFORMATION PROCESSING & MANAGEMENT, 2023, 60 (03)
  • [22] Cross Attention Graph Matching Network for Image-Text Retrieval
    Yang, Xiaoyu
    Xie, Hao
    Mao, Junyi
    Wang, Zhiguo
    Yin, Guangqiang
    PROCEEDINGS OF THE 13TH INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING AND NETWORKS, VOL II, CENET 2023, 2024, 1126 : 274 - 286
  • [23] Negative-Aware Attention Framework for Image-Text Matching
    Zhang, Kun
    Mao, Zhendong
    Wang, Quan
    Zhang, Yongdong
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15640 - 15649
  • [24] Interactive Attention Networks for Semantic Text Matching
    Zhao, Sendong
    Huang, Yong
    Su, Chang
    Li, Yuantong
    Wang, Fei
    20TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM 2020), 2020, : 861 - 870
  • [25] Focus Your Attention: A Bidirectional Focal Attention Network for Image-Text Matching
    Liu, Chunxiao
    Mao, Zhendong
    Liu, An-An
    Zhang, Tianzhu
    Wang, Bin
    Zhang, Yongdong
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 3 - 11
  • [26] Fine-Grained Bidirectional Attention-Based Generative Networks for Image-Text Matching
    Li, Zhixin
    Zhu, Jianwei
    Wei, Jiahui
    Zeng, Yufei
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2022, PT III, 2023, 13715 : 390 - 406
  • [27] Dual-View Semantic Inference Network for image-text matching
    Wu, Chunlei
    Wu, Jie
    Cao, Haiwen
    Wei, Yiwei
    Wang, Leiquan
    NEUROCOMPUTING, 2021, 426 : 47 - 57
  • [28] Identification of Necessary Semantic Undertakers in the Causal View for Image-Text Matching
    Zhang, Huatian
    Zhang, Lei
    Zhang, Kun
    Mao, Zhendong
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7, 2024, : 7105 - 7114
  • [29] A method for image-text matching based on semantic filtering and adaptive adjustment
    Jin, Ran
    Hou, Tengda
    Jin, Tao
    Yuan, Jie
    Du, Chenjie
    EURASIP JOURNAL ON IMAGE AND VIDEO PROCESSING, 2024, 2024 (01)
  • [30] Progressive semantic aggregation and structured cognitive enhancement for image-text matching
    Li, Mingyong
    Gao, Yihua
    Zhao, Honggang
    Li, Ruiheng
    Chen, Junyu
    EXPERT SYSTEMS WITH APPLICATIONS, 2025, 274