3VL: Using Trees to Improve Vision-Language Models' Interpretability

被引:0
|
作者
Yellinek, Nir [1 ]
Karlinsky, Leonid [2 ]
Giryes, Raja [1 ]
机构
[1] Tel Aviv Univ, Iby & Aladar Fleischman Fac Engn, Sch Elect Engn, IL-69978 Tel Aviv, Israel
[2] MIT IBM Watson AI Lab, Cambridge, MA 02142 USA
关键词
Random forests; Visualization; Training; Cognition; Feature extraction; Transformers; Forestry; Animals; Analytical models; Semantics; Convolutional neural networks; Visual Language models (VLMs); explainable AI; compositional reasoning;
D O I
10.1109/TIP.2024.3523801
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision-Language models (VLMs) have proven to be effective at aligning image and text representations, producing superior zero-shot results when transferred to many downstream tasks. However, these representations suffer from some key shortcomings in understanding Compositional Language Concepts (CLC), such as recognizing objects' attributes, states, and relations between different objects. Moreover, VLMs typically have poor interpretability, making it challenging to debug and mitigate compositional-understanding failures. In this work, we introduce the architecture and training technique of Tree-augmented Vision-Language (3VL) model accompanied by our proposed Anchor inference method and Differential Relevance (DiRe) interpretability tool. By expanding the text of an arbitrary image-text pair into a hierarchical tree structure using language analysis tools, 3VL allows the induction of this structure into the visual representation learned by the model, enhancing its interpretability and compositional reasoning. Additionally, we show how Anchor, a simple technique for text unification, can be used to filter nuisance factors while increasing CLC understanding performance, e.g., on the fundamental VL-Checklist benchmark. We also show how DiRe, which performs a differential comparison between VLM relevancy maps, enables us to generate compelling visualizations of the reasons for a model's success or failure.
引用
收藏
页码:495 / 509
页数:15
相关论文
共 50 条
  • [41] Scaling Vision-Language Models with Sparse Mixture of Experts
    Shen, Sheng
    Yao, Zhewei
    Li, Chunyuan
    Darrell, Trevor
    Keutzer, Kurt
    He, Yuxiong
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 11329 - 11344
  • [42] DPO: Discrete Prompt Optimization for Vision-Language Models
    Liang, Nanhao
    Liu, Yong
    IEEE SIGNAL PROCESSING LETTERS, 2025, 32 : 671 - 675
  • [43] On Evaluating Adversarial Robustness of Large Vision-Language Models
    Zhao, Yunqing
    Pang, Tianyu
    Du, Chao
    Yang, Xiao
    Li, Chongxuan
    Cheung, Ngai-Man
    Lin, Min
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [44] Compositional Kronecker Context Optimization for vision-language models
    Ding, Kun
    Li, Xiaohui
    Yu, Qiang
    Wang, Ying
    Zhang, Haojian
    Xiang, Shiming
    NEUROCOMPUTING, 2024, 608
  • [45] Evaluating Object Hallucination in Large Vision-Language Models
    Li, Yifan
    Du, Yifan
    Zhou, Kun
    Wang, Jinpeng
    Zhao, Wayne Xin
    Wen, Ji-Rong
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 292 - 305
  • [46] Adapting vision-language AI models to cardiology tasks
    Arnaout, Rima
    NATURE MEDICINE, 2024, 30 (05) : 1245 - 1246
  • [47] BRAVE: Broadening the Visual Encoding of Vision-Language Models
    Kar, Oguzhan Fatih
    Tonioni, Alessio
    Poklukar, Petra
    Kulshrestha, Achin
    Zamir, Amir
    Tombari, Federico
    COMPUTER VISION - ECCV 2024, PT XVI, 2025, 15074 : 113 - 132
  • [48] Multimodal Search on Iconclass using Vision-Language Pre-Trained Models
    Santini, Cristian
    Posthumus, Etienne
    Tietz, Tabea
    Tan, Mary Ann
    Bruns, Oleksandra
    Sack, Harald
    2023 ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES, JCDL, 2023, : 285 - 287
  • [49] Scale Alone Does not Improve Mechanistic Interpretability in Vision Models
    Zimmermann, Roland S.
    Klein, Thomas
    Brendel, Wieland
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [50] TEXT-IMAGE DE-CONTEXTUALIZATION DETECTION USING VISION-LANGUAGE MODELS
    Huang, Mingzhen
    Jia, Shan
    Chang, Ming-Ching
    Lyu, Siwei
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8967 - 8971