Zero-Shot Translation of Attention Patterns in VQA Models to Natural Language

被引:0
|
作者
Salewski, Leonard [1 ]
Koepke, A. Sophia [1 ]
Lensch, Hendrik P. A. [1 ]
Akata, Zeynep [1 ,2 ]
机构
[1] Univ Tubingen, Tubingen, Germany
[2] MPI Intelligent Syst, Tubingen, Germany
来源
关键词
Zero-Shot Translation of Attention Patterns; VQA;
D O I
10.1007/978-3-031-54605-1_25
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Converting a model's internals to text can yield human-understandable insights about the model. Inspired by the recent success of training-free approaches for image captioning, we propose ZS-A2T, a zero-shot framework that translates the transformer attention of a given model into natural language without requiring any training. We consider this in the context of Visual Question Answering (VQA). ZS-A2T builds on a pre-trained large language model (LLM), which receives a task prompt, question, and predicted answer, as inputs. The LLM is guided to select tokens which describe the regions in the input image that the VQA model attended to. Crucially, we determine this similarity by exploiting the text-image matching capabilities of the underlying VQA model. Our framework does not require any training and allows the drop-in replacement of different guiding sources (e.g. attribution instead of attention maps), or language models. We evaluate this novel task on textual explanation datasets for VQA, giving state-of-the-art performances for the zero-shot setting on GQA-REX and VQA-X. Our code is available here.
引用
收藏
页码:378 / 393
页数:16
相关论文
共 50 条
  • [1] Modularized Zero-shot VQA with Pre-trained Models
    Cao, Rui
    Jiang, Jing
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 58 - 76
  • [2] Exploring Question Decomposition for Zero-Shot VQA
    Khan, Zaid
    Kumar, Vijay B. G.
    Schulter, Samuel
    Chandraker, Manmohan
    Fu, Yun
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [3] JOINT MUSIC AND LANGUAGE ATTENTION MODELS FOR ZERO-SHOT MUSIC TAGGING
    Du, Xingjian
    Yu, Zhesong
    Lin, Jiaju
    Zhu, Bilei
    Kong, Qiuqiang
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 1126 - 1130
  • [4] Zero-shot Natural Language Video Localization
    Nam, Jinwoo
    Ahn, Daechul
    Kang, Dongyeop
    Ha, Seong Jong
    Choi, Jonghyun
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1450 - 1459
  • [5] Large Language Models are Zero-Shot Reasoners
    Kojima, Takeshi
    Gu, Shixiang Shane
    Reid, Machel
    Matsuo, Yutaka
    Iwasawa, Yusuke
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
  • [6] Language Models as Zero-Shot Trajectory Generators
    Kwon, Teyun
    Di Palo, Norman
    Johns, Edward
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2024, 9 (07): : 6728 - 6735
  • [7] Improving Zero-shot Translation with Language-Independent Constraints
    Pham, Ngoc-Quan
    Niehues, Jan
    Ha, Thanh-Le
    Waibel, Alex
    FOURTH CONFERENCE ON MACHINE TRANSLATION (WMT 2019), VOL 1: RESEARCH PAPERS, 2019, : 13 - 23
  • [8] Language Tags Matter for Zero-Shot Neural Machine Translation
    Wu, Liwei
    Cheng, Shanbo
    Wang, Mingxuan
    Li, Lei
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 3001 - 3007
  • [9] Large Language Models as Zero-Shot Conversational Recommenders
    He, Zhankui
    Xie, Zhouhang
    Jha, Rahul
    Steck, Harald
    Liang, Dawen
    Feng, Yesu
    Majumder, Bodhisattwa Prasad
    Kallus, Nathan
    McAuley, Julian
    PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2023, 2023, : 720 - 730
  • [10] Zero-Shot Classification of Art With Large Language Models
    Tojima, Tatsuya
    Yoshida, Mitsuo
    IEEE ACCESS, 2025, 13 : 17426 - 17439