Multi-Modality Cross Attention Network for Image and Sentence Matching

被引:239
|
作者
Wei, Xi [1 ]
Zhang, Tianzhu [1 ]
Li, Yan [2 ]
Zhang, Yongdong [1 ]
Wu, Feng [1 ]
机构
[1] Univ Sci & Technol China, Hefei, Anhui, Peoples R China
[2] Kuaishou Technol, Beijing, Peoples R China
关键词
D O I
10.1109/CVPR42600.2020.01095
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The key of image and sentence matching is to accurately measure the visual-semantic similarity between an image and a sentence. However, most existing methods make use of only the intra-modality relationship within each modality or the inter-modality relationship between image regions and sentence words for the cross-modal matching task. Different from them, in this work, we propose a novel Multi-Modality Cross Attention (MMCA) Network for image and sentence matching by jointly modeling the intra-modality and inter-modality relationships of image regions and sentence words in a unified deep model. In the proposed MMCA, we design a novel cross-attention mechanism, which is able to exploit not only the intra-modality relationship within each modality, but also the inter-modality relationship between image regions and sentence words to complement and enhance each other for image and sentence matching. Extensive experimental results on two standard benchmarks including Flickr30K and MS-COCO demonstrate that the proposed model performs favorably against state-of-the-art image and sentence matching methods.
引用
收藏
页码:10938 / 10947
页数:10
相关论文
共 50 条
  • [31] ReCoNet: Recurrent Correction Network for Fast and Efficient Multi-modality Image Fusion
    Huang, Zhanbo
    Liu, Jinyuan
    Fan, Xin
    Liu, Risheng
    Zhong, Wei
    Luo, Zhongxuan
    COMPUTER VISION - ECCV 2022, PT XVIII, 2022, 13678 : 539 - 555
  • [32] Multi-modality boundary-guided network for generalizable image manipulation localization
    Jiang, Yanyan
    Huang, Yongping
    Chen, Haipeng
    Lyu, Yingda
    MULTIMEDIA SYSTEMS, 2025, 31 (01)
  • [33] Immobilization Bed for Multi-Modality Image Registration
    Nelson, G.
    Bazalova, M.
    Vilalta, M.
    Perez, J.
    Graves, E.
    MEDICAL PHYSICS, 2010, 37 (06)
  • [34] Attention-based Interactions Network for Breast Tumor Classification with Multi-modality Images
    Yang, Xiao
    Xi, Xiaoming
    Xu, Chuanzhen
    Sun, Liangyun
    Meng, Lingzhao
    Nie, Xiushan
    2022 15TH INTERNATIONAL CONFERENCE ON HUMAN SYSTEM INTERACTION (HSI), 2022,
  • [35] Multi-modality Fusion Network for Action Recognition
    Huang, Kai
    Qin, Zheng
    Xu, Kaiping
    Ye, Shuxiong
    Wang, Guolong
    ADVANCES IN MULTIMEDIA INFORMATION PROCESSING - PCM 2017, PT II, 2018, 10736 : 139 - 149
  • [36] Siamese Network cooperating with Multi-head Attention for semantic sentence matching
    Yuan, Zhao
    Jun, Sun
    2020 19TH INTERNATIONAL SYMPOSIUM ON DISTRIBUTED COMPUTING AND APPLICATIONS FOR BUSINESS ENGINEERING AND SCIENCE (DCABES 2020), 2020, : 235 - 238
  • [37] Online Multi-Face Tracking With Multi-Modality Cascaded Matching
    Weng, Zhenyu
    Zhuang, Huiping
    Li, Haizhou
    Ramalingam, Balakrishnan
    Mohan, Rajesh Elara
    Lin, Zhiping
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (06) : 2738 - 2752
  • [38] Multi-modality image fusion for image-guided neurosurgery
    Haller, JW
    Ryken, T
    Madsen, M
    Edwards, A
    Bolinger, L
    Vannier, MW
    CARS '99: COMPUTER ASSISTED RADIOLOGY AND SURGERY, 1999, 1191 : 681 - 685
  • [39] Multi-Modality MR Image Synthesis via Confidence-Guided Aggregation and Cross-Modality Refinement
    Peng, Bo
    Liu, Bingzheng
    Bin, Yi
    Shen, Lili
    Lei, Jianjun
    IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2022, 26 (01) : 27 - 35
  • [40] Learning based Multi-modality Image and Video Compression
    Lu, Guo
    Zhong, Tianxiong
    Geng, Jing
    Hu, Qiang
    Xu, Dong
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 6073 - 6082