Image captioning for effective use of language models in knowledge-based visual question answering

被引:24
|
作者
Salaberria, Ander [1 ]
Azkune, Gorka [1 ]
Lacalle, Oier Lopez de [1 ]
Soroa, Aitor [1 ]
Agirre, Eneko [1 ]
机构
[1] Univ Basque Country UPV EHU, HiTZ Basque Ctr Language Technol, Ixa NLP Grp, M Lardizabal 1, Donostia San Sebastian 20018, Basque Country, Spain
关键词
Visual question answering; Image captioning; Language models; Deep learning;
D O I
10.1016/j.eswa.2022.118669
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Integrating outside knowledge for reasoning in visio-linguistic tasks such as visual question answering (VQA) is an open problem. Given that pretrained language models have been shown to include world knowledge, we propose to use a unimodal (text-only) train and inference procedure based on automatic off-the-shelf captioning of images and pretrained language models. More specifically, we verbalize the image contents and allow language models to better leverage their implicit knowledge to solve knowledge-intensive tasks. Focusing on a visual question answering task which requires external knowledge (OK-VQA), our contributions are: (i) a text-only model that outperforms pretrained multimodal (image-text) models of comparable number of parameters; (ii) confirmation that our text-only method is specially effective for tasks requiring external knowledge, as it is less effective in standard a VQA task (VQA 2.0); and (iii) our method attains results in the state-of-the-art when increasing the size of the language model. We also significantly outperform current multimodal systems, even though augmented with external knowledge. Our qualitative analysis on OK-VQA reveals that automatic captions often fail to capture relevant information in the images, which seems to be balanced by the better inference ability of the text-only language models. Our work opens up possibilities to further improve inference in visio-linguistic tasks.
引用
收藏
页数:10
相关论文
共 50 条
  • [1] Image Captioning and Visual Question Answering Based on Attributes and External Knowledge
    Wu, Qi
    Shen, Chunhua
    Wang, Peng
    Dick, Anthony
    van den Hengel, Anton
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2018, 40 (06) : 1367 - 1381
  • [2] Prompting Large Language Models with Knowledge-Injection for Knowledge-Based Visual Question Answering
    Hu, Zhongjian
    Yang, Peng
    Liu, Fengyuan
    Meng, Yuan
    Liu, Xingyu
    BIG DATA MINING AND ANALYTICS, 2024, 7 (03): : 843 - 857
  • [3] Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering
    Shao, Zhenwei
    Yu, Zhou
    Wang, Meng
    Yu, Jun
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14974 - 14983
  • [4] A visual question answering model based on image captioning
    Zhou, Kun
    Liu, Qiongjie
    Zhao, Dexin
    MULTIMEDIA SYSTEMS, 2024, 30 (06)
  • [5] Image captioning improved visual question answering
    Himanshu Sharma
    Anand Singh Jalal
    Multimedia Tools and Applications, 2022, 81 : 34775 - 34796
  • [6] Image captioning improved visual question answering
    Sharma, Himanshu
    Jalal, Anand Singh
    MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (24) : 34775 - 34796
  • [7] Improving Visual Question Answering by Image Captioning
    Shao, Xiangjun
    Dong, Hongsong
    Wu, Guangsheng
    IEEE ACCESS, 2025, 13 : 46299 - 46311
  • [8] FacGPT: An Effective and Efficient Method for Evaluating Knowledge-Based Visual Question Answering
    Cheng, Sirui
    Zhang, Siyu
    Wu, Jiayi
    Lan, Muchen
    Sun, Yaoru
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT I, NLPCC 2024, 2025, 15359 : 201 - 214
  • [9] Explicit Knowledge-based Reasoning for Visual Question Answering
    Wang, Peng
    Wu, Qi
    Shen, Chunhua
    Dick, Anthony
    van den Hengel, Anton
    PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 1290 - 1296
  • [10] A Simple Baseline for Knowledge-Based Visual Question Answering
    Xenos, Alexandros
    Stafylakis, Themos
    Patras, Ioannis
    Tzimiropoulos, Georgios
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 14871 - 14877