Multi-Modality Cross Attention Network for Image and Sentence Matching

被引：239

作者：

Wei, Xi ^{[1
]}

Zhang, Tianzhu ^{[1
]}

Li, Yan ^{[2
]}

Zhang, Yongdong ^{[1
]}

Wu, Feng ^{[1
]}

机构：

[1] Univ Sci & Technol China, Hefei, Anhui, Peoples R China

[2] Kuaishou Technol, Beijing, Peoples R China

来源：

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020) | 2020年

关键词：

D O I：

10.1109/CVPR42600.2020.01095

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The key of image and sentence matching is to accurately measure the visual-semantic similarity between an image and a sentence. However, most existing methods make use of only the intra-modality relationship within each modality or the inter-modality relationship between image regions and sentence words for the cross-modal matching task. Different from them, in this work, we propose a novel Multi-Modality Cross Attention (MMCA) Network for image and sentence matching by jointly modeling the intra-modality and inter-modality relationships of image regions and sentence words in a unified deep model. In the proposed MMCA, we design a novel cross-attention mechanism, which is able to exploit not only the intra-modality relationship within each modality, but also the inter-modality relationship between image regions and sentence words to complement and enhance each other for image and sentence matching. Extensive experimental results on two standard benchmarks including Flickr30K and MS-COCO demonstrate that the proposed model performs favorably against state-of-the-art image and sentence matching methods.

引用

页码：10938 / 10947

页数：10

共 50 条

[41] An Interpretable Fusion Siamese Network for Multi-Modality Remote Sensing Ship Image Retrieval
Xiong, Wei
Xiong, Zhenyu
Cui, Yaqi
Huang, Linzhou
Yang, Ruining
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (06) : 2696 - 2712
[42] Multi-Modality Medical Image Fusion Using Convolutional Neural Network and Contrast Pyramid
Wang, Kunpeng
Zheng, Mingyao
Wei, Hongyan
Qi, Guanqiu
Li, Yuanyuan
SENSORS, 2020, 20 (08)
[43] AMNet: a new RGB-D instance segmentation network based on attention and multi-modality
Mingyang Wang
Lihua Hu
Yuting Bai
Xiaoling Yao
Jianhua Hu
Sulan Zhang
The Visual Computer, 2024, 40 (2) : 1311 - 1325
[44] Multi-Relation Attention Network for Image Patch Matching
Quan, Dou
Wang, Shuang
Li, Yi
Yang, Bowu
Huyan, Ning
Chanussot, Jocelyn
Hou, Biao
Jiao, Licheng
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 7127 - 7142
[45] AMNet: a new RGB-D instance segmentation network based on attention and multi-modality
Wang, Mingyang
Hu, Lihua
Bai, Yuting
Yao, Xiaoling
Hu, Jianhua
Zhang, Sulan
VISUAL COMPUTER, 2024, 40 (02): : 1311 - 1325
[46] Multi-modality image registration by maximization of mutual information
Maes, F
Collignon, A
Vandermeulen, D
Marchal, G
Suetens, P
PROCEEDINGS OF THE IEEE WORKSHOP ON MATHEMATICAL METHODS IN BIOMEDICAL IMAGE ANALYSIS, 1996, : 14 - 22
[47] A normalised entropy measure for multi-modality image alignment
Studholme, C
Hawkes, DJ
Hill, DLG
MEDICAL IMAGING 1998: IMAGE PROCESSING, PTS 1 AND 2, 1998, 3338 : 132 - 143
[48] MULTI-MODALITY IMAGE REGISTRATION FOR SUBDURAL ELECTRODE LOCALIZATION
Dong, Shuo
Liu, Yuan
Cai, Lixin
Bai, Mei
Yan, Hanmin
BIOMEDICAL ENGINEERING-APPLICATIONS BASIS COMMUNICATIONS, 2014, 26 (05):
[49] Multi-modality Image Registration using the Decomposition Model
Ibrahim, Mazlinda
Chen, Ke
4TH INTERNATIONAL CONFERENCE ON MATHEMATICAL SCIENCES (ICMS4): MATHEMATICAL SCIENCES: CHAMPIONING THE WAY IN A PROBLEM BASED AND DATA DRIVEN SOCIETY, 2017, 1830
[50] Triple-attention interaction network for breast tumor classification based on multi-modality images
Yang, Xiao
Xi, Xiaoming
Wang, Kesong
Sun, Liangyun
Meng, Lingzhao
Nie, Xiushan
Qiao, Lishan
Yin, Yilong
PATTERN RECOGNITION, 2023, 139

← 1 2 3 4 5 →