3VL: Using Trees to Improve Vision-Language Models' Interpretability

被引：0

作者：

Yellinek, Nir ^{[1
]}

Karlinsky, Leonid ^{[2
]}

Giryes, Raja ^{[1
]}

机构：

[1] Tel Aviv Univ, Iby & Aladar Fleischman Fac Engn, Sch Elect Engn, IL-69978 Tel Aviv, Israel

[2] MIT IBM Watson AI Lab, Cambridge, MA 02142 USA

来源：

IEEE TRANSACTIONS ON IMAGE PROCESSING | 2025年 / 34卷

关键词：

Random forests; Visualization; Training; Cognition; Feature extraction; Transformers; Forestry; Animals; Analytical models; Semantics; Convolutional neural networks; Visual Language models (VLMs); explainable AI; compositional reasoning;

D O I：

10.1109/TIP.2024.3523801

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Vision-Language models (VLMs) have proven to be effective at aligning image and text representations, producing superior zero-shot results when transferred to many downstream tasks. However, these representations suffer from some key shortcomings in understanding Compositional Language Concepts (CLC), such as recognizing objects' attributes, states, and relations between different objects. Moreover, VLMs typically have poor interpretability, making it challenging to debug and mitigate compositional-understanding failures. In this work, we introduce the architecture and training technique of Tree-augmented Vision-Language (3VL) model accompanied by our proposed Anchor inference method and Differential Relevance (DiRe) interpretability tool. By expanding the text of an arbitrary image-text pair into a hierarchical tree structure using language analysis tools, 3VL allows the induction of this structure into the visual representation learned by the model, enhancing its interpretability and compositional reasoning. Additionally, we show how Anchor, a simple technique for text unification, can be used to filter nuisance factors while increasing CLC understanding performance, e.g., on the fundamental VL-Checklist benchmark. We also show how DiRe, which performs a differential comparison between VLM relevancy maps, enables us to generate compelling visualizations of the reasons for a model's success or failure.

引用

页码：495 / 509

页数：15

共 50 条

[41] Scaling Vision-Language Models with Sparse Mixture of Experts
Shen, Sheng
Yao, Zhewei
Li, Chunyuan
Darrell, Trevor
Keutzer, Kurt
He, Yuxiong
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 11329 - 11344
[42] DPO: Discrete Prompt Optimization for Vision-Language Models
Liang, Nanhao
Liu, Yong
IEEE SIGNAL PROCESSING LETTERS, 2025, 32 : 671 - 675
[43] On Evaluating Adversarial Robustness of Large Vision-Language Models
Zhao, Yunqing
Pang, Tianyu
Du, Chao
Yang, Xiao
Li, Chongxuan
Cheung, Ngai-Man
Lin, Min
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[44] Compositional Kronecker Context Optimization for vision-language models
Ding, Kun
Li, Xiaohui
Yu, Qiang
Wang, Ying
Zhang, Haojian
Xiang, Shiming
NEUROCOMPUTING, 2024, 608
[45] Evaluating Object Hallucination in Large Vision-Language Models
Li, Yifan
Du, Yifan
Zhou, Kun
Wang, Jinpeng
Zhao, Wayne Xin
Wen, Ji-Rong
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 292 - 305
[46] Adapting vision-language AI models to cardiology tasks
Arnaout, Rima
NATURE MEDICINE, 2024, 30 (05) : 1245 - 1246
[47] BRAVE: Broadening the Visual Encoding of Vision-Language Models
Kar, Oguzhan Fatih
Tonioni, Alessio
Poklukar, Petra
Kulshrestha, Achin
Zamir, Amir
Tombari, Federico
COMPUTER VISION - ECCV 2024, PT XVI, 2025, 15074 : 113 - 132
[48] Multimodal Search on Iconclass using Vision-Language Pre-Trained Models
Santini, Cristian
Posthumus, Etienne
Tietz, Tabea
Tan, Mary Ann
Bruns, Oleksandra
Sack, Harald
2023 ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES, JCDL, 2023, : 285 - 287
[49] Scale Alone Does not Improve Mechanistic Interpretability in Vision Models
Zimmermann, Roland S.
Klein, Thomas
Brendel, Wieland
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[50] TEXT-IMAGE DE-CONTEXTUALIZATION DETECTION USING VISION-LANGUAGE MODELS
Huang, Mingzhen
Jia, Shan
Chang, Ming-Ching
Lyu, Siwei
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8967 - 8971

← 1 2 3 4 5 →