Woodpecker: hallucination correction for multimodal large language models

被引:0
|
作者
Yin, Shukang [1 ]
Fu, Chaoyou [2 ,3 ]
Zhao, Sirui [1 ]
Xu, Tong [1 ]
Wang, Hao [1 ]
Sui, Dianbo [4 ]
Shen, Yunhang [5 ]
Li, Ke [5 ]
Sun, Xing [5 ]
Chen, Enhong [1 ]
机构
[1] Univ Sci & Technol China, Sch Artificial Intelligence & Data Sci, Hefei 230026, Peoples R China
[2] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing 210023, Peoples R China
[3] Nanjing Univ, Sch Intelligence Sci & Technol, Suzhou 215163, Peoples R China
[4] Chinese Acad Sci, Inst Automat, Beijing 100190, Peoples R China
[5] YouTu, Shanghai 200233, Peoples R China
基金
中国国家自然科学基金;
关键词
multimodal learning; multimodal large language models; hallucination correction; large language models; vision and language;
D O I
10.1007/s11432-024-4251-x
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Hallucinations is a big shadow hanging over the rapidly evolving multimodal large language models (MLLMs), referring to that the generated text is inconsistent with the image content. To mitigate hallucinations, existing studies mainly resort to an instruction-tuning manner that requires retraining the models with specific data. In this paper, we pave a different way, introducing a training-free method named Woodpecker. Like woodpeckers heal trees, it picks out and corrects hallucinations from the generated text. Concretely, Woodpecker consists of five stages: key concept extraction, question formulation, visual knowledge validation, visual claim generation, and hallucination correction. Implemented in a post-remedy manner, Woodpecker can easily serve different MLLMs, while being interpretable by accessing intermediate outputs of the five stages. We evaluate Woodpecker both quantitatively and qualitatively and show the huge potential of this new paradigm. On the POPE benchmark, our method obtains a 30.66%/24.33% improvement in accuracy over the baseline MiniGPT-4/mPLUG-Owl. The source code is released at https://github.com/BradyFU/Woodpecker.
引用
收藏
页数:13
相关论文
共 50 条
  • [21] Evolution and Prospects of Foundation Models: From Large Language Models to Large Multimodal Models
    Chen, Zheyi
    Xu, Liuchang
    Zheng, Hongting
    Chen, Luyao
    Tolba, Amr
    Zhao, Liang
    Yu, Keping
    Feng, Hailin
    CMC-COMPUTERS MATERIALS & CONTINUA, 2024, 80 (02): : 1753 - 1808
  • [22] Large language models and multimodal foundation models for precision oncology
    Truhn, Daniel
    Eckardt, Jan-Niklas
    Ferber, Dyke
    Kather, Jakob Nikolas
    NPJ PRECISION ONCOLOGY, 2024, 8 (01)
  • [23] Large language models and multimodal foundation models for precision oncology
    Daniel Truhn
    Jan-Niklas Eckardt
    Dyke Ferber
    Jakob Nikolas Kather
    npj Precision Oncology, 8
  • [24] Object Hallucination Detection in Large Vision Language Models via Evidential Conflict
    Liu, Zhekun
    Huang, Tao
    Wang, Rui
    Jing, Liping
    BELIEF FUNCTIONS: THEORY AND APPLICATIONS, BELIEF 2024, 2024, 14909 : 58 - 67
  • [25] Towards Mitigating Hallucination in Large Language Models via Self-Reflection
    Ji, Ziwei
    Yu, Tiezheng
    Xu, Yan
    Lee, Nayeon
    Ishii, Etsuko
    Fung, Pascale
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 1827 - 1843
  • [26] A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
    Huang, Lei
    Yu, Weijiang
    Ma, Weitao
    Zhong, Weihong
    Feng, Zhangyin
    Wang, Haotian
    Chen, Qianglong
    Peng, Weihua
    Feng, Xiaocheng
    Qin, Bing
    Liu, Ting
    ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2025, 43 (02)
  • [27] Can We Edit Multimodal Large Language Models?
    Cheng, Siyuan
    Tian, Bozhong
    Liu, Qingbin
    Chen, Xi
    Wang, Yongheng
    Chen, Huajun
    Zhang, Ningyu
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 13877 - 13888
  • [28] Contextual Object Detection with Multimodal Large Language Models
    Zang, Yuhang
    Li, Wei
    Han, Jun
    Zhou, Kaiyang
    Loy, Chen Change
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2025, 133 (02) : 825 - 843
  • [29] Investigating the Catastrophic Forgetting in Multimodal Large Language Models
    Zhai, Yuexiang
    Tong, Shengbang
    Li, Xiao
    Cai, Mu
    Qu, Qing
    Lee, Yong Jae
    Ma, Yi
    CONFERENCE ON PARSIMONY AND LEARNING, VOL 234, 2024, 234 : 202 - 227
  • [30] Multimodal Food Image Classification with Large Language Models
    Kim, Jun-Hwa
    Kim, Nam-Ho
    Jo, Donghyeok
    Won, Chee Sun
    ELECTRONICS, 2024, 13 (22)