Multi-Modality Cross Attention Network for Image and Sentence Matching

被引:239
|
作者
Wei, Xi [1 ]
Zhang, Tianzhu [1 ]
Li, Yan [2 ]
Zhang, Yongdong [1 ]
Wu, Feng [1 ]
机构
[1] Univ Sci & Technol China, Hefei, Anhui, Peoples R China
[2] Kuaishou Technol, Beijing, Peoples R China
关键词
D O I
10.1109/CVPR42600.2020.01095
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The key of image and sentence matching is to accurately measure the visual-semantic similarity between an image and a sentence. However, most existing methods make use of only the intra-modality relationship within each modality or the inter-modality relationship between image regions and sentence words for the cross-modal matching task. Different from them, in this work, we propose a novel Multi-Modality Cross Attention (MMCA) Network for image and sentence matching by jointly modeling the intra-modality and inter-modality relationships of image regions and sentence words in a unified deep model. In the proposed MMCA, we design a novel cross-attention mechanism, which is able to exploit not only the intra-modality relationship within each modality, but also the inter-modality relationship between image regions and sentence words to complement and enhance each other for image and sentence matching. Extensive experimental results on two standard benchmarks including Flickr30K and MS-COCO demonstrate that the proposed model performs favorably against state-of-the-art image and sentence matching methods.
引用
收藏
页码:10938 / 10947
页数:10
相关论文
共 50 条
  • [41] An Interpretable Fusion Siamese Network for Multi-Modality Remote Sensing Ship Image Retrieval
    Xiong, Wei
    Xiong, Zhenyu
    Cui, Yaqi
    Huang, Linzhou
    Yang, Ruining
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (06) : 2696 - 2712
  • [42] Multi-Modality Medical Image Fusion Using Convolutional Neural Network and Contrast Pyramid
    Wang, Kunpeng
    Zheng, Mingyao
    Wei, Hongyan
    Qi, Guanqiu
    Li, Yuanyuan
    SENSORS, 2020, 20 (08)
  • [43] AMNet: a new RGB-D instance segmentation network based on attention and multi-modality
    Mingyang Wang
    Lihua Hu
    Yuting Bai
    Xiaoling Yao
    Jianhua Hu
    Sulan Zhang
    The Visual Computer, 2024, 40 (2) : 1311 - 1325
  • [44] Multi-Relation Attention Network for Image Patch Matching
    Quan, Dou
    Wang, Shuang
    Li, Yi
    Yang, Bowu
    Huyan, Ning
    Chanussot, Jocelyn
    Hou, Biao
    Jiao, Licheng
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 7127 - 7142
  • [45] AMNet: a new RGB-D instance segmentation network based on attention and multi-modality
    Wang, Mingyang
    Hu, Lihua
    Bai, Yuting
    Yao, Xiaoling
    Hu, Jianhua
    Zhang, Sulan
    VISUAL COMPUTER, 2024, 40 (02): : 1311 - 1325
  • [46] Multi-modality image registration by maximization of mutual information
    Maes, F
    Collignon, A
    Vandermeulen, D
    Marchal, G
    Suetens, P
    PROCEEDINGS OF THE IEEE WORKSHOP ON MATHEMATICAL METHODS IN BIOMEDICAL IMAGE ANALYSIS, 1996, : 14 - 22
  • [47] A normalised entropy measure for multi-modality image alignment
    Studholme, C
    Hawkes, DJ
    Hill, DLG
    MEDICAL IMAGING 1998: IMAGE PROCESSING, PTS 1 AND 2, 1998, 3338 : 132 - 143
  • [48] MULTI-MODALITY IMAGE REGISTRATION FOR SUBDURAL ELECTRODE LOCALIZATION
    Dong, Shuo
    Liu, Yuan
    Cai, Lixin
    Bai, Mei
    Yan, Hanmin
    BIOMEDICAL ENGINEERING-APPLICATIONS BASIS COMMUNICATIONS, 2014, 26 (05):
  • [49] Multi-modality Image Registration using the Decomposition Model
    Ibrahim, Mazlinda
    Chen, Ke
    4TH INTERNATIONAL CONFERENCE ON MATHEMATICAL SCIENCES (ICMS4): MATHEMATICAL SCIENCES: CHAMPIONING THE WAY IN A PROBLEM BASED AND DATA DRIVEN SOCIETY, 2017, 1830
  • [50] Triple-attention interaction network for breast tumor classification based on multi-modality images
    Yang, Xiao
    Xi, Xiaoming
    Wang, Kesong
    Sun, Liangyun
    Meng, Lingzhao
    Nie, Xiushan
    Qiao, Lishan
    Yin, Yilong
    PATTERN RECOGNITION, 2023, 139