Cross-Modal Retrieval With Partially Mismatched Pairs

被引:34
|
作者
Hu, Peng [1 ]
Huang, Zhenyu [1 ]
Peng, Dezhong [1 ,2 ,3 ]
Wang, Xu [1 ]
Peng, Xi [1 ]
机构
[1] Sichuan Univ, Coll Comp Sci, Chengdu 610065, Peoples R China
[2] Chengdu Ruibei Yingte Informat Technol Co Ltd, Chengdu 610054, Peoples R China
[3] Sichuan Zhiqian Technol Co Ltd, Chengdu 610065, Peoples R China
基金
中国国家自然科学基金; 中国博士后科学基金; 国家重点研发计划;
关键词
Complementary contrastive learning; cross-modal retrieval; mismatched pairs;
D O I
10.1109/TPAMI.2023.3247939
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we study a challenging but less-touched problem in cross-modal retrieval, i.e., partially mismatched pairs (PMPs). Specifically, in real-world scenarios, a huge number of multimedia data (e.g., the Conceptual Captions dataset) are collected from the Internet, and thus it is inevitable to wrongly treat some irrelevant cross-modal pairs as matched. Undoubtedly, such a PMP problem will remarkably degrade the cross-modal retrieval performance. To tackle this problem, we derive a unified theoretical Robust Cross-modal Learning framework (RCL) with an unbiased estimator of the cross-modal retrieval risk, which aims to endow the cross-modal retrieval methods with robustness against PMPs. In detail, our RCL adopts a novel complementary contrastive learning paradigm to address the following two challenges, i.e., the overfitting and underfitting issues. On the one hand, our method only utilizes the negative information which is much less likely false compared with the positive information, thus avoiding the overfitting issue to PMPs. However, these robust strategies could induce underfitting issues, thus making training models more difficult. On the other hand, to address the underfitting issue brought by weak supervision, we present to leverage of all available negative pairs to enhance the supervision contained in the negative information. Moreover, to further improve the performance, we propose to minimize the upper bounds of the risk to pay more attention to hard samples. To verify the effectiveness and robustness of the proposed method, we carry out comprehensive experiments on five widely-used benchmark datasets compared with nine state-of-the-art approaches w.r.t. the image-text and video-text retrieval tasks. The code is available at https://github.com/penghu-cs/RCL.
引用
收藏
页码:9595 / 9610
页数:16
相关论文
共 50 条
  • [41] Robust Cross-Modal Retrieval by Adversarial Training
    Zhang, Tao
    Sun, Shiliang
    Zhao, Jing
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [42] Cross-modal retrieval of scripted speech audio
    Owen, CB
    Makedon, F
    MULTIMEDIA COMPUTING AND NETWORKING 1998, 1997, 3310 : 226 - 235
  • [43] Augmented Adversarial Training for Cross-Modal Retrieval
    Wu, Yiling
    Wang, Shuhui
    Song, Guoli
    Huang, Qingming
    IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 : 559 - 571
  • [44] Robust cross-modal retrieval with alignment refurbishment
    Guo, Jinyi
    Ding, Jieyu
    FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2023, 24 (10) : 1403 - 1415
  • [45] Fast Unmediated Hashing for Cross-Modal Retrieval
    Nie, Xiushan
    Liu, Xingbo
    Xi, Xiaoming
    Li, Chenglong
    Yin, Yilong
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021, 31 (09) : 3669 - 3678
  • [46] Cross-Modal Retrieval with Improved Graph Convolution
    Hongtu, Zhang
    Chunjian, Hua
    Yi, Jiang
    Jianfeng, Yu
    Ying, Chen
    Computer Engineering and Applications, 2024, 60 (11) : 95 - 104
  • [47] Analyzing semantic correlation for cross-modal retrieval
    Liang Xie
    Peng Pan
    Yansheng Lu
    Multimedia Systems, 2015, 21 : 525 - 539
  • [48] Special issue on cross-modal retrieval and analysis
    Jianlong Wu
    Richang Hong
    Qi Tian
    International Journal of Multimedia Information Retrieval, 2022, 11 : 523 - 524
  • [49] Masking Modalities for Cross-modal Video Retrieval
    Gabeur, Valentin
    Nagrani, Arsha
    Sun, Chen
    Alahari, Karteek
    Schmid, Cordelia
    2022 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2022), 2022, : 2111 - 2120
  • [50] Deep Memory Network for Cross-Modal Retrieval
    Song, Ge
    Wang, Dong
    Tan, Xiaoyang
    IEEE TRANSACTIONS ON MULTIMEDIA, 2019, 21 (05) : 1261 - 1275