Unified Adaptive Relevance Distinguishable Attention Network for Image-Text Matching

被引:44
|
作者
Zhang, Kun [1 ]
Mao, Zhendong [1 ]
Liu, An-An [3 ]
Zhang, Yongdong [1 ,2 ]
机构
[1] Univ Sci & Technol China, Sch Informat Sci & Technol, Hefei 230022, Anhui, Peoples R China
[2] Hefei Comprehens Natl Sci Ctr, Inst Artificial Intelligence, Hefei 230022, Anhui, Peoples R China
[3] Tianjin Univ, Sch Elect & Informat Engn, Tianjin 300072, Peoples R China
关键词
Semantics; Optimization; Visualization; Training; Task analysis; Representation learning; Correlation; Image-text matching; attention network; unified adaptive relevance distinguishable learning;
D O I
10.1109/TMM.2022.3141603
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Image-text matching, as a fundamental cross-modal task, bridges the gap between vision and language. The core is to accurately learn semantic alignment to find relevant shared semantics in image and text. Existing methods typically attend to all fragments with word-region similarity greater than empirical threshold zero as relevant shared semantics, e.g., via a ReLU operation that forces the negative to zero and maintains the positive. However, this fixed threshold is totally isolated with feature learning, which cannot adaptively and accurately distinguish the varying distributions of relevant and irrelevant word-region similarity in training, inevitably limiting the semantic alignment learning. To solve this issue, we propose a novel Unified Adaptive Relevance Distinguishable Attention (UARDA) mechanism, incorporating the relevance threshold into a unified learning framework, to maximally distinguish the relevant and irrelevant distributions to obtain better semantic alignment. Specifically, our method adaptively learns the optimal relevance boundary between these two distributions to improve the model to learn more discriminative features. The explicit relevance threshold is well integrated into similarity matching, which kills two birds with one stone as: (1) excluding the disturbances of irrelevant fragment contents to aggregate precisely relevant shared semantics for boosting matching accuracy, and (2) avoiding the calculation of irrelevant fragment queries for reducing retrieval time. Experimental results on benchmarks show that UARDA can substantially and consistently outperform state-of-the-arts, with relative rSum improvements of 2%-4% (16.9%-35.3% for baseline SCAN), and reducing the retrieval time by 50%-73%.
引用
收藏
页码:1320 / 1332
页数:13
相关论文
共 50 条
  • [41] Cross-modal independent matching network for image-text retrieval
    Ke, Xiao
    Chen, Baitao
    Yang, Xiong
    Cai, Yuhang
    Liu, Hao
    Guo, Wenzhong
    PATTERN RECOGNITION, 2025, 159
  • [42] Multi-scale motivated neural network for image-text matching
    Xueyang Qin
    Lishuang Li
    Guangyao Pang
    Multimedia Tools and Applications, 2024, 83 : 4383 - 4407
  • [43] CycleMatch: A cycle-consistent embedding network for image-text matching
    Liu, Yu
    Guo, Yanming
    Liu, Li
    Bakker, Erwin M.
    Lew, Michael S.
    PATTERN RECOGNITION, 2019, 93 : 365 - 379
  • [44] Image-text interaction graph neural network for image-text sentiment analysis
    Wenxiong Liao
    Bi Zeng
    Jianqi Liu
    Pengfei Wei
    Jiongkun Fang
    Applied Intelligence, 2022, 52 : 11184 - 11198
  • [45] Image-text interaction graph neural network for image-text sentiment analysis
    Liao, Wenxiong
    Zeng, Bi
    Liu, Jianqi
    Wei, Pengfei
    Fang, Jiongkun
    APPLIED INTELLIGENCE, 2022, 52 (10) : 11184 - 11198
  • [46] Global Relation-Aware Attention Network for Image-Text Retrieval
    Cao, Jie
    Qian, Shengsheng
    Zhang, Huaiwen
    Fang, Quan
    Xu, Changsheng
    PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 19 - 28
  • [47] Thangka Image-Text Matching Based on Adaptive Pooling Layer and Improved Transformer
    Wang, Kaijie
    Wang, Tiejun
    Guo, Xiaoran
    Xu, Kui
    Wu, Jiao
    APPLIED SCIENCES-BASEL, 2024, 14 (02):
  • [48] Similarity Reasoning and Filtration for Image-Text Matching
    Diao, Haiwen
    Zhang, Ying
    Ma, Lin
    Lu, Huchuan
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 1218 - 1226
  • [49] Asymmetric Polysemous Reasoning for Image-Text Matching
    Zhang, Hongping
    Yang, Ming
    2023 23RD IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS, ICDMW 2023, 2023, : 1013 - 1022
  • [50] Visual Semantic Reasoning for Image-Text Matching
    Li, Kunpeng
    Zhang, Yulun
    Li, Kai
    Li, Yuanyuan
    Fu, Yun
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 4653 - 4661