Image captioning for effective use of language models in knowledge-based visual question answering

被引:24
|
作者
Salaberria, Ander [1 ]
Azkune, Gorka [1 ]
Lacalle, Oier Lopez de [1 ]
Soroa, Aitor [1 ]
Agirre, Eneko [1 ]
机构
[1] Univ Basque Country UPV EHU, HiTZ Basque Ctr Language Technol, Ixa NLP Grp, M Lardizabal 1, Donostia San Sebastian 20018, Basque Country, Spain
关键词
Visual question answering; Image captioning; Language models; Deep learning;
D O I
10.1016/j.eswa.2022.118669
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Integrating outside knowledge for reasoning in visio-linguistic tasks such as visual question answering (VQA) is an open problem. Given that pretrained language models have been shown to include world knowledge, we propose to use a unimodal (text-only) train and inference procedure based on automatic off-the-shelf captioning of images and pretrained language models. More specifically, we verbalize the image contents and allow language models to better leverage their implicit knowledge to solve knowledge-intensive tasks. Focusing on a visual question answering task which requires external knowledge (OK-VQA), our contributions are: (i) a text-only model that outperforms pretrained multimodal (image-text) models of comparable number of parameters; (ii) confirmation that our text-only method is specially effective for tasks requiring external knowledge, as it is less effective in standard a VQA task (VQA 2.0); and (iii) our method attains results in the state-of-the-art when increasing the size of the language model. We also significantly outperform current multimodal systems, even though augmented with external knowledge. Our qualitative analysis on OK-VQA reveals that automatic captions often fail to capture relevant information in the images, which seems to be balanced by the better inference ability of the text-only language models. Our work opens up possibilities to further improve inference in visio-linguistic tasks.
引用
收藏
页数:10
相关论文
共 50 条
  • [31] Explainable Knowledge reasoning via thought chains for knowledge-based visual question answering
    Qiu, Chen
    Xie, Zhiqiang
    Liu, Maofu
    Hu, Huijun
    INFORMATION PROCESSING & MANAGEMENT, 2024, 61 (04)
  • [32] Knowledge-Based Question and Answering System for Turkish
    Yasar, Pinar
    Sahin, Irem
    Adali, Esref
    2019 4TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ENGINEERING (UBMK), 2019, : 307 - 312
  • [33] Knowledge-Based Question Answering as Machine Translation
    Bao, Junwei
    Duan, Nan
    Zhou, Ming
    Zhao, Tiejun
    PROCEEDINGS OF THE 52ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, 2014, : 967 - 976
  • [34] Medical knowledge-based network for Patient-oriented Visual Question Answering
    Jian, Huang
    Chen, Yihao
    Yong, Li
    Yang, Zhenguo
    Gong, Xuehao
    Lee, Wang Fu
    Xu, Xiaohong
    Liu, Wenyin
    INFORMATION PROCESSING & MANAGEMENT, 2023, 60 (02)
  • [35] ViQuAE, a Dataset for Knowledge-based Visual Question Answering about Named Entities
    Lerner, Paul
    Ferret, Olivier
    Guinaudeau, Camille
    Le Borgne, Herve
    Besancon, Romaric
    Moreno, Jose G.
    Melgarejo, Jesus Lovon
    PROCEEDINGS OF THE 45TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '22), 2022, : 3108 - 3120
  • [36] Effective Search of Logical Forms for Weakly Supervised Knowledge-Based Question Answering
    Shen, Tao
    Geng, Xiubo
    Long, Guodong
    Jiang, Jing
    Zhang, Chengqi
    Jiang, Daxin
    PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 2227 - 2233
  • [37] A Retriever-Reader Framework with Visual Entity Linking for Knowledge-Based Visual Question Answering
    You, Jiuxiang
    Yang, Zhenguo
    Li, Qing
    Liu, Wenyin
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 13 - 18
  • [38] MKEAH: Multimodal knowledge extraction and accumulation based on hyperplane embedding for knowledge-based visual question answering
    Zhang, Heng
    Wei, Zhihua
    Liu, Guanming
    Wang, Rui
    Mu, Ruibin
    Liu, Chuanbao
    Yuan, Aiquan
    Cao, Guodong
    Hu, Ning
    Virtual Reality and Intelligent Hardware, 6 (04): : 280 - 291
  • [39] MKEAH: Multimodal knowledge extraction and accumulation based on hyperplane embedding for knowledge-based visual question answering
    Heng ZHANG
    Zhihua WEI
    Guanming LIU
    Rui WANG
    Ruibin MU
    Chuanbao LIU
    Aiquan YUAN
    Guodong CAO
    Ning HU
    虚拟现实与智能硬件(中英文), 2024, 6 (04) : 280 - 291
  • [40] Improving and Diagnosing Knowledge-Based Visual Question Answering via Entity Enhanced Knowledge Injection
    Garcia-Olano, Diego
    Onoe, Yasumasa
    Ghosh, Joydeep
    COMPANION PROCEEDINGS OF THE WEB CONFERENCE 2022, WWW 2022 COMPANION, 2022, : 705 - 715