Image captioning for effective use of language models in knowledge-based visual question answering

被引:24
|
作者
Salaberria, Ander [1 ]
Azkune, Gorka [1 ]
Lacalle, Oier Lopez de [1 ]
Soroa, Aitor [1 ]
Agirre, Eneko [1 ]
机构
[1] Univ Basque Country UPV EHU, HiTZ Basque Ctr Language Technol, Ixa NLP Grp, M Lardizabal 1, Donostia San Sebastian 20018, Basque Country, Spain
关键词
Visual question answering; Image captioning; Language models; Deep learning;
D O I
10.1016/j.eswa.2022.118669
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Integrating outside knowledge for reasoning in visio-linguistic tasks such as visual question answering (VQA) is an open problem. Given that pretrained language models have been shown to include world knowledge, we propose to use a unimodal (text-only) train and inference procedure based on automatic off-the-shelf captioning of images and pretrained language models. More specifically, we verbalize the image contents and allow language models to better leverage their implicit knowledge to solve knowledge-intensive tasks. Focusing on a visual question answering task which requires external knowledge (OK-VQA), our contributions are: (i) a text-only model that outperforms pretrained multimodal (image-text) models of comparable number of parameters; (ii) confirmation that our text-only method is specially effective for tasks requiring external knowledge, as it is less effective in standard a VQA task (VQA 2.0); and (iii) our method attains results in the state-of-the-art when increasing the size of the language model. We also significantly outperform current multimodal systems, even though augmented with external knowledge. Our qualitative analysis on OK-VQA reveals that automatic captions often fail to capture relevant information in the images, which seems to be balanced by the better inference ability of the text-only language models. Our work opens up possibilities to further improve inference in visio-linguistic tasks.
引用
收藏
页数:10
相关论文
共 50 条
  • [41] Cross-modality Multiple Relations Learning for Knowledge-based Visual Question Answering
    Wang, Yan
    Li, Peize
    Si, Qingyi
    Zhang, Hanwen
    Zang, Wenyu
    Lin, Zheng
    Fu, Peng
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (03)
  • [42] Inner Knowledge-based Img2Doc Scheme for Visual Question Answering
    Li, Qun
    Xiao, Fu
    Bhanu, Bir
    Sheng, Biyun
    Hong, Richang
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2022, 18 (03)
  • [43] Weakly-Supervised Visual-Retriever-Reader for Knowledge-based Question Answering
    Luo, Man
    Zeng, Yankai
    Banerjee, Pratyay
    Baral, Chitta
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 6417 - 6431
  • [44] Knowledge-Based Visual Question Answering Using Multi-Modal Semantic Graph
    Jiang, Lei
    Meng, Zuqiang
    ELECTRONICS, 2023, 12 (06)
  • [45] Asking Clarification Questions in Knowledge-Based Question Answering
    Xu, Jingjing
    Wang, Yuechen
    Tang, Duyu
    Duan, Nan
    Yang, Pengcheng
    Zeng, Qi
    Zhou, Ming
    Sun, Xu
    2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 1618 - 1629
  • [46] Direct relation detection for knowledge-based question answering
    Shamsabadi, Abbas Shahini
    Ramezani, Reza
    Farsani, Hadi Khosravi
    Nematbakhsh, Mohammadali
    EXPERT SYSTEMS WITH APPLICATIONS, 2023, 211
  • [47] Knowledge-Based Approach to Question Answering System Selection
    Konys, Agnieszka
    COMPUTATIONAL COLLECTIVE INTELLIGENCE (ICCCI 2015), PT I, 2015, 9329 : 361 - 370
  • [48] Knowledge-Based Visual Question Generation
    Xie, Jiayuan
    Fang, Wenhao
    Cai, Yi
    Huang, Qingbao
    Li, Qing
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (11) : 7547 - 7558
  • [49] The Core of Smart Cities: Knowledge Representation and Descriptive Framework Construction in Knowledge-Based Visual Question Answering
    Wang, Ruiping
    Wu, Shihong
    Wang, Xiaoping
    SUSTAINABILITY, 2022, 14 (20)
  • [50] Fast Parameter Adaptation for Few-shot Image Captioning and Visual Question Answering
    Dong, Xuanyi
    Zhu, Linchao
    Zhang, De
    Yang, Yi
    Wu, Fei
    PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 54 - 62