3VL: Using Trees to Improve Vision-Language Models' Interpretability

被引:0
|
作者
Yellinek, Nir [1 ]
Karlinsky, Leonid [2 ]
Giryes, Raja [1 ]
机构
[1] Tel Aviv Univ, Iby & Aladar Fleischman Fac Engn, Sch Elect Engn, IL-69978 Tel Aviv, Israel
[2] MIT IBM Watson AI Lab, Cambridge, MA 02142 USA
关键词
Random forests; Visualization; Training; Cognition; Feature extraction; Transformers; Forestry; Animals; Analytical models; Semantics; Convolutional neural networks; Visual Language models (VLMs); explainable AI; compositional reasoning;
D O I
10.1109/TIP.2024.3523801
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision-Language models (VLMs) have proven to be effective at aligning image and text representations, producing superior zero-shot results when transferred to many downstream tasks. However, these representations suffer from some key shortcomings in understanding Compositional Language Concepts (CLC), such as recognizing objects' attributes, states, and relations between different objects. Moreover, VLMs typically have poor interpretability, making it challenging to debug and mitigate compositional-understanding failures. In this work, we introduce the architecture and training technique of Tree-augmented Vision-Language (3VL) model accompanied by our proposed Anchor inference method and Differential Relevance (DiRe) interpretability tool. By expanding the text of an arbitrary image-text pair into a hierarchical tree structure using language analysis tools, 3VL allows the induction of this structure into the visual representation learned by the model, enhancing its interpretability and compositional reasoning. Additionally, we show how Anchor, a simple technique for text unification, can be used to filter nuisance factors while increasing CLC understanding performance, e.g., on the fundamental VL-Checklist benchmark. We also show how DiRe, which performs a differential comparison between VLM relevancy maps, enables us to generate compelling visualizations of the reasons for a model's success or failure.
引用
收藏
页码:495 / 509
页数:15
相关论文
共 50 条
  • [31] Towards an Exhaustive Evaluation of Vision-Language Foundation Models
    Salin, Emmanuelle
    Ayache, Stephane
    Favre, Benoit
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 339 - 352
  • [32] Attention Prompting on Image for Large Vision-Language Models
    Yu, Runpeng
    Yu, Weihao
    Wang, Xinchao
    COMPUTER VISION - ECCV 2024, PT XXX, 2025, 15088 : 251 - 268
  • [33] Learning with Enriched Inductive Biases for Vision-Language Models
    Yang, Lingxiao
    Zhang, Ru-Yuan
    Chen, Qi
    Xie, Xiaohua
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2025,
  • [34] Effectiveness assessment of recent large vision-language models
    Yao Jiang
    Xinyu Yan
    Ge-Peng Ji
    Keren Fu
    Meijun Sun
    Huan Xiong
    Deng-Ping Fan
    Fahad Shahbaz Khan
    Visual Intelligence, 2 (1):
  • [35] Tuning Vision-Language Models With Multiple Prototypes Clustering
    Guo, Meng-Hao
    Zhang, Yi
    Mu, Tai-Jiang
    Huang, Sharon X.
    Hu, Shi-Min
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (12) : 11186 - 11199
  • [36] uCAP: An Unsupervised Prompting Method for Vision-Language Models
    Nguyen, A. Tuan
    Tai, Kai Sheng
    Chen, Bor-Chun
    Shukla, Satya Narayan
    Yu, Harichao
    Torr, Philip
    Tian, Tai-Peng
    Lim, Ser-Nam
    COMPUTER VISION - ECCV 2024, PT LXXIV, 2025, 15132 : 425 - 439
  • [37] Disease-Informed Adaptation of Vision-Language Models
    Zhang, Jiajin
    Wang, Ge
    Kalra, Mannudeep K.
    Yan, Pingkun
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2024, PT XI, 2024, 15011 : 232 - 242
  • [38] DeAR: Debiasing Vision-Language Models with Additive Residuals
    Seth, Ashish
    Hemani, Mayur
    Agarwal, Chirag
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6820 - 6829
  • [39] Learning Domain Invariant Prompt for Vision-Language Models
    Zhao, Cairong
    Wang, Yubin
    Jiang, Xinyang
    Shen, Yifei
    Song, Kaitao
    Li, Dongsheng
    Miao, Duoqian
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 1348 - 1360
  • [40] ECO: Ensembling Context Optimization for Vision-Language Models
    Agnolucci, Lorenzo
    Baldrati, Alberto
    Todino, Francesco
    Becattini, Federico
    Bertini, Marco
    Del Bimbo, Alberto
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 2803 - 2807