Multi-Modality Cross Attention Network for Image and Sentence Matching

被引：239

作者：

Wei, Xi ^{[1
]}

Zhang, Tianzhu ^{[1
]}

Li, Yan ^{[2
]}

Zhang, Yongdong ^{[1
]}

Wu, Feng ^{[1
]}

机构：

[1] Univ Sci & Technol China, Hefei, Anhui, Peoples R China

[2] Kuaishou Technol, Beijing, Peoples R China

来源：

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020) | 2020年

关键词：

D O I：

10.1109/CVPR42600.2020.01095

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The key of image and sentence matching is to accurately measure the visual-semantic similarity between an image and a sentence. However, most existing methods make use of only the intra-modality relationship within each modality or the inter-modality relationship between image regions and sentence words for the cross-modal matching task. Different from them, in this work, we propose a novel Multi-Modality Cross Attention (MMCA) Network for image and sentence matching by jointly modeling the intra-modality and inter-modality relationships of image regions and sentence words in a unified deep model. In the proposed MMCA, we design a novel cross-attention mechanism, which is able to exploit not only the intra-modality relationship within each modality, but also the inter-modality relationship between image regions and sentence words to complement and enhance each other for image and sentence matching. Extensive experimental results on two standard benchmarks including Flickr30K and MS-COCO demonstrate that the proposed model performs favorably against state-of-the-art image and sentence matching methods.

引用

页码：10938 / 10947

页数：10

共 50 条

[31] ReCoNet: Recurrent Correction Network for Fast and Efficient Multi-modality Image Fusion
Huang, Zhanbo
Liu, Jinyuan
Fan, Xin
Liu, Risheng
Zhong, Wei
Luo, Zhongxuan
COMPUTER VISION - ECCV 2022, PT XVIII, 2022, 13678 : 539 - 555
[32] Multi-modality boundary-guided network for generalizable image manipulation localization
Jiang, Yanyan
Huang, Yongping
Chen, Haipeng
Lyu, Yingda
MULTIMEDIA SYSTEMS, 2025, 31 (01)
[33] Immobilization Bed for Multi-Modality Image Registration
Nelson, G.
Bazalova, M.
Vilalta, M.
Perez, J.
Graves, E.
MEDICAL PHYSICS, 2010, 37 (06)
[34] Attention-based Interactions Network for Breast Tumor Classification with Multi-modality Images
Yang, Xiao
Xi, Xiaoming
Xu, Chuanzhen
Sun, Liangyun
Meng, Lingzhao
Nie, Xiushan
2022 15TH INTERNATIONAL CONFERENCE ON HUMAN SYSTEM INTERACTION (HSI), 2022,
[35] Multi-modality Fusion Network for Action Recognition
Huang, Kai
Qin, Zheng
Xu, Kaiping
Ye, Shuxiong
Wang, Guolong
ADVANCES IN MULTIMEDIA INFORMATION PROCESSING - PCM 2017, PT II, 2018, 10736 : 139 - 149
[36] Siamese Network cooperating with Multi-head Attention for semantic sentence matching
Yuan, Zhao
Jun, Sun
2020 19TH INTERNATIONAL SYMPOSIUM ON DISTRIBUTED COMPUTING AND APPLICATIONS FOR BUSINESS ENGINEERING AND SCIENCE (DCABES 2020), 2020, : 235 - 238
[37] Online Multi-Face Tracking With Multi-Modality Cascaded Matching
Weng, Zhenyu
Zhuang, Huiping
Li, Haizhou
Ramalingam, Balakrishnan
Mohan, Rajesh Elara
Lin, Zhiping
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (06) : 2738 - 2752
[38] Multi-modality image fusion for image-guided neurosurgery
Haller, JW
Ryken, T
Madsen, M
Edwards, A
Bolinger, L
Vannier, MW
CARS '99: COMPUTER ASSISTED RADIOLOGY AND SURGERY, 1999, 1191 : 681 - 685
[39] Multi-Modality MR Image Synthesis via Confidence-Guided Aggregation and Cross-Modality Refinement
Peng, Bo
Liu, Bingzheng
Bin, Yi
Shen, Lili
Lei, Jianjun
IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2022, 26 (01) : 27 - 35
[40] Learning based Multi-modality Image and Video Compression
Lu, Guo
Zhong, Tianxiong
Geng, Jing
Hu, Qiang
Xu, Dong
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 6073 - 6082

← 1 2 3 4 5 →