Unified Adaptive Relevance Distinguishable Attention Network for Image-Text Matching

被引:44
|
作者
Zhang, Kun [1 ]
Mao, Zhendong [1 ]
Liu, An-An [3 ]
Zhang, Yongdong [1 ,2 ]
机构
[1] Univ Sci & Technol China, Sch Informat Sci & Technol, Hefei 230022, Anhui, Peoples R China
[2] Hefei Comprehens Natl Sci Ctr, Inst Artificial Intelligence, Hefei 230022, Anhui, Peoples R China
[3] Tianjin Univ, Sch Elect & Informat Engn, Tianjin 300072, Peoples R China
关键词
Semantics; Optimization; Visualization; Training; Task analysis; Representation learning; Correlation; Image-text matching; attention network; unified adaptive relevance distinguishable learning;
D O I
10.1109/TMM.2022.3141603
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Image-text matching, as a fundamental cross-modal task, bridges the gap between vision and language. The core is to accurately learn semantic alignment to find relevant shared semantics in image and text. Existing methods typically attend to all fragments with word-region similarity greater than empirical threshold zero as relevant shared semantics, e.g., via a ReLU operation that forces the negative to zero and maintains the positive. However, this fixed threshold is totally isolated with feature learning, which cannot adaptively and accurately distinguish the varying distributions of relevant and irrelevant word-region similarity in training, inevitably limiting the semantic alignment learning. To solve this issue, we propose a novel Unified Adaptive Relevance Distinguishable Attention (UARDA) mechanism, incorporating the relevance threshold into a unified learning framework, to maximally distinguish the relevant and irrelevant distributions to obtain better semantic alignment. Specifically, our method adaptively learns the optimal relevance boundary between these two distributions to improve the model to learn more discriminative features. The explicit relevance threshold is well integrated into similarity matching, which kills two birds with one stone as: (1) excluding the disturbances of irrelevant fragment contents to aggregate precisely relevant shared semantics for boosting matching accuracy, and (2) avoiding the calculation of irrelevant fragment queries for reducing retrieval time. Experimental results on benchmarks show that UARDA can substantially and consistently outperform state-of-the-arts, with relative rSum improvements of 2%-4% (16.9%-35.3% for baseline SCAN), and reducing the retrieval time by 50%-73%.
引用
收藏
页码:1320 / 1332
页数:13
相关论文
共 50 条
  • [31] Team HUGE: Image-Text Matching via Hierarchical and Unified Graph Enhancing
    Li, Bo
    Wu, You
    Li, Zhixin
    PROCEEDINGS OF THE 4TH ANNUAL ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2024, 2024, : 704 - 712
  • [32] Context-Aware Attention Network for Image-Text Retrieval
    Zhang, Qi
    Lei, Zhen
    Zhang, Zhaoxiang
    Li, Stan Z.
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 3533 - 3542
  • [33] Re-ranking image-text matching by adaptive metric fusion
    Niu, Kai
    Huang, Yan
    Wang, Liang
    PATTERN RECOGNITION, 2020, 104
  • [34] A method for image-text matching based on semantic filtering and adaptive adjustment
    Jin, Ran
    Hou, Tengda
    Jin, Tao
    Yuan, Jie
    Du, Chenjie
    EURASIP JOURNAL ON IMAGE AND VIDEO PROCESSING, 2024, 2024 (01)
  • [35] Multi-scale motivated neural network for image-text matching
    Qin, Xueyang
    Li, Lishuang
    Pang, Guangyao
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (2) : 4383 - 4407
  • [36] Dual-View Semantic Inference Network for image-text matching
    Wu, Chunlei
    Wu, Jie
    Cao, Haiwen
    Wei, Yiwei
    Wang, Leiquan
    NEUROCOMPUTING, 2021, 426 : 47 - 57
  • [37] Cross-modal Semantically Augmented Network for Image-text Matching
    Yao, Tao
    Li, Yiru
    Li, Ying
    Zhu, Yingying
    Wang, Gang
    Yue, Jun
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (04)
  • [38] Cross-modal Graph Matching Network for Image-text Retrieval
    Cheng, Yuhao
    Zhu, Xiaoguang
    Qian, Jiuchao
    Wen, Fei
    Liu, Peilin
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2022, 18 (04)
  • [39] Local Alignment with Global Semantic Consistence Network for Image-Text Matching
    Li, Pengwei
    Wu, Shihua
    Lian, Zhichao
    2022 IEEE INTL CONF ON DEPENDABLE, AUTONOMIC AND SECURE COMPUTING, INTL CONF ON PERVASIVE INTELLIGENCE AND COMPUTING, INTL CONF ON CLOUD AND BIG DATA COMPUTING, INTL CONF ON CYBER SCIENCE AND TECHNOLOGY CONGRESS (DASC/PICOM/CBDCOM/CYBERSCITECH), 2022, : 652 - 657
  • [40] Step-Wise Hierarchical Alignment Network for Image-Text Matching
    Ji, Zhong
    Chen, Kexin
    Wang, Haoran
    PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021, 2021, : 765 - 771