Cross-Modal Retrieval Augmentation for Multi-Modal Classification

被引:0
|
作者
Gur, Shir [1 ,3 ]
Neverova, Natalia [2 ]
Stauffer, Chris [2 ]
Lim, Ser-Nam [2 ]
Kiela, Douwe [2 ]
Reiter, Austin [2 ]
机构
[1] Tel Aviv Univ, Tel Aviv, Israel
[2] Facebook AI, Menlo Pk, CA USA
[3] FAIR, Menlo Pk, CA USA
关键词
KNOWLEDGE; LANGUAGE;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent advances in using retrieval components over external knowledge sources have shown impressive results for a variety of downstream tasks in natural language processing. Here, we explore the use of unstructured external knowledge sources of images and their corresponding captions for improving visual question answering (VQA). First, we train a novel alignment model for embedding images and captions in the same space, which achieves substantial improvements in performance on image-caption retrieval w.r.t. similar methods. Second, we show that retrieval-augmented multi-modal transformers using the trained alignment model improve results on VQA over strong baselines. We further conduct extensive experiments to establish the promise of this approach, and examine novel applications for inference time such as hot-swapping indices.
引用
收藏
页码:111 / 123
页数:13
相关论文
共 50 条
  • [21] Complementarity is the king: Multi-modal and multi-grained hierarchical semantic enhancement network for cross-modal retrieval
    Pei, Xinlei
    Liu, Zheng
    Gao, Shanshan
    Su, Yijun
    EXPERT SYSTEMS WITH APPLICATIONS, 2023, 216
  • [22] Unsupervised cross-modal retrieval via Multi-modal graph regularized Smooth Matrix Factorization Hashing
    Fang, Yixian
    Zhang, Huaxiang
    Ren, Yuwei
    KNOWLEDGE-BASED SYSTEMS, 2019, 171 : 69 - 80
  • [23] CrossCLR: Cross-modal Contrastive Learning For Multi-modal Video Representations
    Zolfaghari, Mohammadreza
    Zhu, Yi
    Gehler, Peter
    Brox, Thomas
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1430 - 1439
  • [24] Multi-Label Cross-modal Retrieval
    Ranjan, Viresh
    Rasiwasia, Nikhil
    Jawahar, C. V.
    2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4094 - 4102
  • [25] Cross-modal incongruity aligning and collaborating for multi-modal sarcasm detection
    Wang, Jie
    Yang, Yan
    Jiang, Yongquan
    Ma, Minbo
    Xie, Zhuyang
    Li, Tianrui
    INFORMATION FUSION, 2024, 103
  • [26] Contextual and Cross-Modal Interaction for Multi-Modal Speech Emotion Recognition
    Yang, Dingkang
    Huang, Shuai
    Liu, Yang
    Zhang, Lihua
    IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 2093 - 2097
  • [27] CROSS-MODAL KNOWLEDGE DISTILLATION IN MULTI-MODAL FAKE NEWS DETECTION
    Wei, Zimian
    Pan, Hengyue
    Qiao, Linbo
    Niu, Xin
    Dong, Peijie
    Li, Dongsheng
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4733 - 4737
  • [28] HCMSL: Hybrid Cross-modal Similarity Learning for Cross-modal Retrieval
    Zhang, Chengyuan
    Song, Jiayu
    Zhu, Xiaofeng
    Zhu, Lei
    Zhang, Shichao
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2021, 17 (01)
  • [29] CMC-MMR: multi-modal recommendation model with cross-modal correction
    Wang, Yubin
    Xia, Hongbin
    Liu, Yuan
    JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2024, 62 (05) : 1187 - 1211
  • [30] Learnable Cross-modal Knowledge Distillation for Multi-modal Learning with Missing Modality
    Wang, Hu
    Ma, Congbo
    Zhang, Jianpeng
    Zhang, Yuan
    Avery, Jodie
    Hull, Louise
    Carneiro, Gustavo
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT IV, 2023, 14223 : 216 - 226