Cross-Modal Retrieval Augmentation for Multi-Modal Classification

被引:0
|
作者
Gur, Shir [1 ,3 ]
Neverova, Natalia [2 ]
Stauffer, Chris [2 ]
Lim, Ser-Nam [2 ]
Kiela, Douwe [2 ]
Reiter, Austin [2 ]
机构
[1] Tel Aviv Univ, Tel Aviv, Israel
[2] Facebook AI, Menlo Pk, CA USA
[3] FAIR, Menlo Pk, CA USA
关键词
KNOWLEDGE; LANGUAGE;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent advances in using retrieval components over external knowledge sources have shown impressive results for a variety of downstream tasks in natural language processing. Here, we explore the use of unstructured external knowledge sources of images and their corresponding captions for improving visual question answering (VQA). First, we train a novel alignment model for embedding images and captions in the same space, which achieves substantial improvements in performance on image-caption retrieval w.r.t. similar methods. Second, we show that retrieval-augmented multi-modal transformers using the trained alignment model improve results on VQA over strong baselines. We further conduct extensive experiments to establish the promise of this approach, and examine novel applications for inference time such as hot-swapping indices.
引用
收藏
页码:111 / 123
页数:13
相关论文
共 50 条
  • [31] Cross-Modal Semantic Alignment and Information Refinement for Multi-Modal Sentiment Analysis
    Ding, Meirong
    Chen, Hongye
    Zeng, Biqing
    Computer Engineering and Applications, 2024, 60 (22) : 114 - 125
  • [32] Multi-modal Dictionary BERT for Cross-modal Video Search in Baidu Advertising
    Yu, Tan
    Yang, Yi
    Li, Yi
    Liu, Lin
    Sun, Mingming
    Li, Ping
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021, 2021, : 4341 - 4351
  • [33] Learning Cross-Modal Deep Representations for Multi-Modal MR Image Segmentation
    Li, Cheng
    Sun, Hui
    Liu, Zaiyi
    Wang, Meiyun
    Zheng, Hairong
    Wang, Shanshan
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2019, PT II, 2019, 11765 : 57 - 65
  • [34] Cross-modal context-gated convolution for multi-modal sentiment analysis
    Wen, Huanglu
    You, Shaodi
    Fu, Ying
    PATTERN RECOGNITION LETTERS, 2021, 146 : 252 - 259
  • [35] Multi-Modal Sarcasm Detection via Cross-Modal Graph Convolutional Network
    Liang, Bin
    Lou, Chenwei
    Li, Xiang
    Yang, Min
    Gui, Lin
    He, Yulan
    Pei, Wenjie
    Xu, Ruifeng
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 1767 - 1777
  • [36] CA_DeepSC: Cross-Modal Alignment for Multi-Modal Semantic Communications
    Wang, Wenjun
    Liu, Minghao
    Chen, Mingkai
    IEEE CONFERENCE ON GLOBAL COMMUNICATIONS, GLOBECOM, 2023, : 5871 - 5876
  • [37] Adversarial Cross-Modal Retrieval
    Wang, Bokun
    Yang, Yang
    Xu, Xing
    Hanjalic, Alan
    Shen, Heng Tao
    PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 154 - 162
  • [38] Multi-Modal Medical Image Matching Based on Multi-Task Learning and Semantic-Enhanced Cross-Modal Retrieval
    Zhang, Yilin
    TRAITEMENT DU SIGNAL, 2023, 40 (05) : 2041 - 2049
  • [39] Multi-hop Interactive Cross-Modal Retrieval
    Ning, Xuecheng
    Yang, Xiaoshan
    Xu, Changsheng
    MULTIMEDIA MODELING (MMM 2020), PT II, 2020, 11962 : 681 - 693
  • [40] A semi-supervised cross-modal memory bank for cross-modal retrieval
    Huang, Yingying
    Hu, Bingliang
    Zhang, Yipeng
    Gao, Chi
    Wang, Quan
    NEUROCOMPUTING, 2024, 579