Drowzee: Metamorphic Testing for Fact-Conflicting Hallucination Detection in Large Language Models

被引：0

作者：

Li, Ningke ^{[1
]}

Li, Yuekang ^{[2
]}

Liu, Yi ^{[3
]}

Shi, Ling ^{[3
]}

Wan, Kailong ^{[1
]}

Wang, Haoyu ^{[1
]}

机构：

[1] Huazhong Univ Sci & Technol, Wuhan, Peoples R China

[2] Univ New South Wales, Kensington, Australia

[3] Nanyang Technol Univ, Nanyang, Singapore

来源：

PROCEEDINGS OF THE ACM ON PROGRAMMING LANGUAGES-PACMPL | 2024年 / 8卷 / OOPSLA2期

基金：

国家重点研发计划;

关键词：

Large Language Model; Hallucination; Software Testing;

D O I：

10.1145/3689776

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Large language models (LLMs) have revolutionized language processing, but face critical challenges with security, privacy, and generating hallucinations - coherent but factually inaccurate outputs. A major issue is fact-conflicting hallucination (FCH), where LLMs produce content contradicting ground truth facts. Addressing FCH is difficult due to two key challenges: 1) Automatically constructing and updating benchmark datasets is hard, as existing methods rely on manually curated static benchmarks that cannot cover the broad, evolving spectrum of FCH cases. 2) Validating the reasoning behind LLM outputs is inherently difficult, especially for complex logical relations. To tackle these challenges, we introduce a novel logic-programming-aided metamorphic testing technique for FCH detection. We develop an extensive and extensible framework that constructs a comprehensive factual knowledge base by crawling sources like Wikipedia, seamlessly integrated into DROWZEE. Using logical reasoning rules, we transform and augment this knowledge into a large set of test cases with ground truth answers. We test LLMs on these cases through template-based prompts, requiring them to provide reasoned answers. To validate their reasoning, we propose two semantic-aware oracles that assess the similarity between the semantic structures of the LLM answers and ground truth. Our approach automatically generates useful test cases and identifies hallucinations across six LLMs within nine domains, with hallucination rates ranging from 24.7% to 59.8%. Key findings include LLMs struggling with temporal concepts, out-of-distribution knowledge, and lack of logical reasoning capabilities. The results show that logic-based test cases generated by DROWZEE effectively trigger and detect hallucinations. To further mitigate the identified FCHs, we explored model editing techniques, which proved effective on a small scale (with edits to fewer than 1000 knowledge pieces). Our findings emphasize the need for continued community efforts to detect and mitigate model hallucinations.

引用

页数：30

共 50 条

[1] FactCHD: Benchmarking Fact-Conflicting Hallucination Detection
Chen, Xiang
Song, Duanzheng
Gui, Honghao
Wang, Chenxi
Zhang, Ningyu
Jiang, Yong
Huang, Fei
Lyu, Chengfei
Zhang, Dan
Chen, Huajun
PROCEEDINGS OF THE THIRTY-THIRD INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2024, 2024, : 6216 - 6224
[2] Hallucination Detection for Generative Large Language Models by Bayesian Sequential Estimation
Wang, Xiaohua
Yan, Yuliang
Huang, Longtao
Zheng, Xiaoqing
Huang, Xuanjing
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 15361 - 15371
[3] Hallucination Detection: Robustly Discerning Reliable Answers in Large Language Models
Chen, Yuyan
Fu, Qiang
Yuan, Yichen
Wen, Zhihao
Fan, Ge
Liu, Dayiheng
Zhang, Dongmei
Li, Zhixu
Xiao, Yanghua
PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2023, 2023, : 245 - 255
[4] HILL: A Hallucination Identifier for Large Language Models
Leiser, Florian
Eckhardt, Sven
Leuthe, Valentin
Knaeble, Merlin
Maedche, Alexander
Schwabe, Gerhard
Sunyaev, Ali
PROCEEDINGS OF THE 2024 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYTEMS (CHI 2024), 2024,
[5] Object Hallucination Detection in Large Vision Language Models via Evidential Conflict
Liu, Zhekun
Huang, Tao
Wang, Rui
Jing, Liping
BELIEF FUNCTIONS: THEORY AND APPLICATIONS, BELIEF 2024, 2024, 14909 : 58 - 67
[6] Large Language Models: The Next Frontier for Variable Discovery within Metamorphic Testing
Tsigkanos, Christos
Rani, Pooja
Mueller, Sebastian
Kehrer, Timo
2023 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE ANALYSIS, EVOLUTION AND REENGINEERING, SANER, 2023, : 678 - 682
[7] Sources of Hallucination by Large Language Models on Inference Tasks
McKenna, Nick
Li, Tianyi
Cheng, Liang
Hosseini, Mohammad Javad
Johnson, Mark
Steedman, Mark
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 2758 - 2774
[8] Woodpecker: hallucination correction for multimodal large language models
Yin, Shukang
Fu, Chaoyou
Zhao, Sirui
Xu, Tong
Wang, Hao
Sui, Dianbo
Shen, Yunhang
Li, Ke
Sun, Xing
Chen, Enhong
SCIENCE CHINA-INFORMATION SCIENCES, 2024, 67 (12)
[9] Woodpecker: hallucination correction for multimodal large language models
Shukang YIN
Chaoyou FU
Sirui ZHAO
Tong XU
Hao WANG
Dianbo SUI
Yunhang SHEN
Ke LI
Xing SUN
Enhong CHEN
Science China(Information Sciences), 2024, 67 (12) : 52 - 64
[10] Mitigating Factual Inconsistency and Hallucination in Large Language Models
Muneeswaran, I
Shankar, Advaith
Varun, V.
Gopalakrishnan, Saisubramaniam
Vaddina, Vishal
PROCEEDINGS OF THE 17TH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, WSDM 2024, 2024, : 1169 - 1170

← 1 2 3 4 5 →