3VL: Using Trees to Improve Vision-Language Models' Interpretability

被引：0

作者：

Yellinek, Nir ^{[1
]}

Karlinsky, Leonid ^{[2
]}

Giryes, Raja ^{[1
]}

机构：

[1] Tel Aviv Univ, Iby & Aladar Fleischman Fac Engn, Sch Elect Engn, IL-69978 Tel Aviv, Israel

[2] MIT IBM Watson AI Lab, Cambridge, MA 02142 USA

来源：

IEEE TRANSACTIONS ON IMAGE PROCESSING | 2025年 / 34卷

关键词：

Random forests; Visualization; Training; Cognition; Feature extraction; Transformers; Forestry; Animals; Analytical models; Semantics; Convolutional neural networks; Visual Language models (VLMs); explainable AI; compositional reasoning;

D O I：

10.1109/TIP.2024.3523801

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Vision-Language models (VLMs) have proven to be effective at aligning image and text representations, producing superior zero-shot results when transferred to many downstream tasks. However, these representations suffer from some key shortcomings in understanding Compositional Language Concepts (CLC), such as recognizing objects' attributes, states, and relations between different objects. Moreover, VLMs typically have poor interpretability, making it challenging to debug and mitigate compositional-understanding failures. In this work, we introduce the architecture and training technique of Tree-augmented Vision-Language (3VL) model accompanied by our proposed Anchor inference method and Differential Relevance (DiRe) interpretability tool. By expanding the text of an arbitrary image-text pair into a hierarchical tree structure using language analysis tools, 3VL allows the induction of this structure into the visual representation learned by the model, enhancing its interpretability and compositional reasoning. Additionally, we show how Anchor, a simple technique for text unification, can be used to filter nuisance factors while increasing CLC understanding performance, e.g., on the fundamental VL-Checklist benchmark. We also show how DiRe, which performs a differential comparison between VLM relevancy maps, enables us to generate compelling visualizations of the reasons for a model's success or failure.

引用

页码：495 / 509

页数：15

共 50 条

[21] Adventures of Trustworthy Vision-Language Models: A Survey
Vatsa, Mayank
Jain, Anubhooti
Singh, Richa
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 20, 2024, : 22650 - 22658
[22] Equivariant Similarity for Vision-Language Foundation Models
Wang, Tan
Lin, Kevin
Li, Linjie
Lin, Chung-Ching
Yang, Zhengyuan
Zhang, Hanwang
Liu, Zicheng
Wang, Lijuan
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 11964 - 11974
[23] VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers
Aflalo, Estelle
Du, Meng
Tseng, Shao-Yen
Liu, Yongfei
Wu, Chenfei
Duan, Nan
Lal, Vasudev
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 21374 - 21383
[24] Towards Better Vision-Inspired Vision-Language Models
Cao, Yun-Hao
Ji, Kaixiang
Huang, Ziyuan
Zheng, Chuanyang
Liu, Jiajia
Wang, Jian
Chen, Jingdong
Yang, Ming
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13537 - 13547
[25] Fast Certification of Vision-Language Models Using Incremental Randomized Smoothing
Nirala, Ashutosh
Joshi, Ameya
Sarkar, Soumik
Hegde, Chinmay
IEEE CONFERENCE ON SAFE AND TRUSTWORTHY MACHINE LEARNING, SATML 2024, 2024, : 252 - 271
[26] AnomalyGPT: Detecting Industrial Anomalies Using Large Vision-Language Models
Gu, Zhaopeng
Zhu, Bingke
Zhu, Guibo
Chen, Yingying
Tang, Ming
Wang, Jinqiao
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 3, 2024, : 1932 - 1940
[27] BlenderAlchemy: Editing 3D Graphics with Vision-Language Models
Huang, Ian
Yang, Guandao
Guibas, Leonidas
COMPUTER VISION - ECCV 2024, PT LXXXIX, 2025, 15147 : 297 - 314
[28] Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks
Zhang, Xinsong
Zeng, Yan
Zhang, Jipeng
Li, Hang
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 551 - 568
[29] VinVL: Revisiting Visual Representations in Vision-Language Models
Zhang, Pengchuan
Li, Xiujun
Hu, Xiaowei
Yang, Jianwei
Zhang, Lei
Wang, Lijuan
Choi, Yejin
Gao, Jianfeng
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 5575 - 5584
[30] Evaluating Attribute Comprehension in Large Vision-Language Models
Zhang, Haiwen
Yang, Zixi
Liu, Yuanzhi
Wang, Xinran
He, Zheqi
Liang, Kongming
Ma, Zhanyu
PATTERN RECOGNITION AND COMPUTER VISION, PT V, PRCV 2024, 2025, 15035 : 98 - 113

← 1 2 3 4 5 →