3VL: Using Trees to Improve Vision-Language Models' Interpretability

被引:0
|
作者
Yellinek, Nir [1 ]
Karlinsky, Leonid [2 ]
Giryes, Raja [1 ]
机构
[1] Tel Aviv Univ, Iby & Aladar Fleischman Fac Engn, Sch Elect Engn, IL-69978 Tel Aviv, Israel
[2] MIT IBM Watson AI Lab, Cambridge, MA 02142 USA
关键词
Random forests; Visualization; Training; Cognition; Feature extraction; Transformers; Forestry; Animals; Analytical models; Semantics; Convolutional neural networks; Visual Language models (VLMs); explainable AI; compositional reasoning;
D O I
10.1109/TIP.2024.3523801
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision-Language models (VLMs) have proven to be effective at aligning image and text representations, producing superior zero-shot results when transferred to many downstream tasks. However, these representations suffer from some key shortcomings in understanding Compositional Language Concepts (CLC), such as recognizing objects' attributes, states, and relations between different objects. Moreover, VLMs typically have poor interpretability, making it challenging to debug and mitigate compositional-understanding failures. In this work, we introduce the architecture and training technique of Tree-augmented Vision-Language (3VL) model accompanied by our proposed Anchor inference method and Differential Relevance (DiRe) interpretability tool. By expanding the text of an arbitrary image-text pair into a hierarchical tree structure using language analysis tools, 3VL allows the induction of this structure into the visual representation learned by the model, enhancing its interpretability and compositional reasoning. Additionally, we show how Anchor, a simple technique for text unification, can be used to filter nuisance factors while increasing CLC understanding performance, e.g., on the fundamental VL-Checklist benchmark. We also show how DiRe, which performs a differential comparison between VLM relevancy maps, enables us to generate compelling visualizations of the reasons for a model's success or failure.
引用
收藏
页码:495 / 509
页数:15
相关论文
共 50 条
  • [21] Adventures of Trustworthy Vision-Language Models: A Survey
    Vatsa, Mayank
    Jain, Anubhooti
    Singh, Richa
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 20, 2024, : 22650 - 22658
  • [22] Equivariant Similarity for Vision-Language Foundation Models
    Wang, Tan
    Lin, Kevin
    Li, Linjie
    Lin, Chung-Ching
    Yang, Zhengyuan
    Zhang, Hanwang
    Liu, Zicheng
    Wang, Lijuan
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 11964 - 11974
  • [23] VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers
    Aflalo, Estelle
    Du, Meng
    Tseng, Shao-Yen
    Liu, Yongfei
    Wu, Chenfei
    Duan, Nan
    Lal, Vasudev
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 21374 - 21383
  • [24] Towards Better Vision-Inspired Vision-Language Models
    Cao, Yun-Hao
    Ji, Kaixiang
    Huang, Ziyuan
    Zheng, Chuanyang
    Liu, Jiajia
    Wang, Jian
    Chen, Jingdong
    Yang, Ming
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13537 - 13547
  • [25] Fast Certification of Vision-Language Models Using Incremental Randomized Smoothing
    Nirala, Ashutosh
    Joshi, Ameya
    Sarkar, Soumik
    Hegde, Chinmay
    IEEE CONFERENCE ON SAFE AND TRUSTWORTHY MACHINE LEARNING, SATML 2024, 2024, : 252 - 271
  • [26] AnomalyGPT: Detecting Industrial Anomalies Using Large Vision-Language Models
    Gu, Zhaopeng
    Zhu, Bingke
    Zhu, Guibo
    Chen, Yingying
    Tang, Ming
    Wang, Jinqiao
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 3, 2024, : 1932 - 1940
  • [27] BlenderAlchemy: Editing 3D Graphics with Vision-Language Models
    Huang, Ian
    Yang, Guandao
    Guibas, Leonidas
    COMPUTER VISION - ECCV 2024, PT LXXXIX, 2025, 15147 : 297 - 314
  • [28] Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks
    Zhang, Xinsong
    Zeng, Yan
    Zhang, Jipeng
    Li, Hang
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 551 - 568
  • [29] VinVL: Revisiting Visual Representations in Vision-Language Models
    Zhang, Pengchuan
    Li, Xiujun
    Hu, Xiaowei
    Yang, Jianwei
    Zhang, Lei
    Wang, Lijuan
    Choi, Yejin
    Gao, Jianfeng
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 5575 - 5584
  • [30] Evaluating Attribute Comprehension in Large Vision-Language Models
    Zhang, Haiwen
    Yang, Zixi
    Liu, Yuanzhi
    Wang, Xinran
    He, Zheqi
    Liang, Kongming
    Ma, Zhanyu
    PATTERN RECOGNITION AND COMPUTER VISION, PT V, PRCV 2024, 2025, 15035 : 98 - 113