Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and Beyond

被引：0

作者：

Xu, Fangzhi ^{[1
]}

Lin, Qika ^{[1
]}

Han, Jiawei ^{[1
]}

Zhao, Tianzhe ^{[1
]}

Liu, Jun ^{[2
]}

Cambria, Erik ^{[3
]}

机构：

[1] Xi An Jiao Tong Univ, Sch Comp Sci & Technol, Key Lab Intelligent Networks & Net work Secur, Minist Educ, Xian 710049, Shaanxi, Peoples R China

[2] Shaanxi Prov Key Lab Big Data Knowledge Engn, Xian 710049, Shaanxi, Peoples R China

[3] Nanyang Technol Univ, Coll Comp & Data Sci, Singapore 639798, Singapore

来源：

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING | 2025年 / 37卷 / 04期

基金：

中国国家自然科学基金;

关键词：

Cognition; Benchmark testing; Measurement; Large language models; Self-aware; Systematics; Redundancy; Knowledge engineering; Chatbots; Accuracy; Logical reasoning; large language model; deductive reasoning; inductive reasoning; abductive reasoning;

D O I：

10.1109/TKDE.2025.3536008

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Logical reasoning consistently plays a fundamental and significant role in the domains of knowledge engineering and artificial intelligence. Recently, Large Language Models (LLMs) have emerged as a noteworthy innovation in natural language processing (NLP). However, the question of whether LLMs can effectively address the task of logical reasoning, which requires gradual cognitive inference similar to human intelligence, remains unanswered. To this end, we aim to bridge this gap and provide comprehensive evaluations in this paper. First, to offer systematic evaluations, we select fifteen typical logical reasoning datasets and organize them into deductive, inductive, abductive and mixed-form reasoning settings. Considering the comprehensiveness of evaluations, we include 3 early-era representative LLMs and 4 trending LLMs. Second, different from previous evaluations relying only on simple metrics (e.g., accuracy), we propose fine-level evaluations in objective and subjective manners, covering both answers and explanations, including answer correctness, explain correctness, explain completeness and explain redundancy. Additionally, to uncover the logical flaws of LLMs, problematic cases will be attributed to five error types from two dimensions, i.e., evidence selection process and reasoning process. Third, to avoid the influences of knowledge bias and concentrate purely on benchmarking the logical reasoning capability of LLMs, we propose a new dataset with neutral content. Based on the in-depth evaluations, this paper finally forms a general evaluation scheme of logical reasoning capability from six dimensions (i.e., Correct, Rigorous, Self-aware, Active, Oriented and No hallucination). It reflects the pros and cons of LLMs and gives guiding directions for future works.

引用

页码：1620 / 1634

页数：15

共 50 条

[31] Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking
Xu, Nan
Wang, Fei
Zhou, Ben
Li, Bangzheng
Xiao, Chaowei
Chen, Muhao
Findings of the Association for Computational Linguistics: NAACL 2024 - Findings, 2024, : 3526 - 3548
[32] A Survey on Evaluation of Large Language Models
Chang, Yupeng
Wang, Xu
Wang, Jindong
Wu, Yuan
Yang, Linyi
Zhu, Kaijie
Chen, Hao
Yi, Xiaoyuan
Wang, Cunxiang
Wang, Yidong
Ye, Wei
Zhang, Yue
Chang, Yi
Yu, Philip S.
Yang, Qiang
Xie, Xing
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2024, 15 (03)
[33] Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation
Liu, Jiawei
Xia, Chunqiu Steven
Wang, Yuyao
Zhang, Lingming
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[34] Large Language Models Are Clinical Reasoners: Reasoning-Aware Diagnosis Framework with Prompt-Generated Rationales
Kwon, Taeyoon
Ong, Kai Tzu-iunn
Kang, Dongjin
Moon, Seungjun
Lee, Jeong Ryong
Hwang, Dosik
Sohn, Beomseok
Sim, Yongsik
Lee, Dongha
Yeo, Jinyoung
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 18417 - 18425
[35] Large Language Models in Orthopaedic Publications: The Good, the Bad and the Ugly
Wascher, Daniel C.
Ollivier, Matthieu
AMERICAN JOURNAL OF SPORTS MEDICINE, 2024, 52 (09): : 2193 - 2195
[36] ChatGPT for good? On opportunities and challenges of large language models for education
Kasneci, Enkelejda
Sessler, Kathrin
Kuechemann, Stefan
Bannert, Maria
Dementieva, Daryna
Fischer, Frank
Gasser, Urs
Groh, Georg
Guennemann, Stephan
Huellermeier, Eyke
Krusche, Stepha
Kutyniok, Gitta
Michaeli, Tilman
Nerdel, Claudia
Pfeffer, Juergen
Poquet, Oleksandra
Sailer, Michael
Schmidt, Albrecht
Seidel, Tina
Stadler, Matthias
Weller, Jochen
Kuhn, Jochen
Kasneci, Gjergji
LEARNING AND INDIVIDUAL DIFFERENCES, 2023, 103
[37] Large Language Models in Orthopaedic Publications: The Good, the Bad and the Ugly
Wascher, Daniel C.
Ollivier, Matthieu
ORTHOPAEDIC JOURNAL OF SPORTS MEDICINE, 2024, 12 (08)
[38] How good are large language models at product risk assessment?
Collier, Zachary A.
Gruss, Richard J.
Abrahams, Alan S.
RISK ANALYSIS, 2024,
[39] A Comprehensive Survey of Datasets for Large Language Model Evaluation
Lu, Yuting
Sun, Chao
Yan, Yuchao
Zhu, Hegong
Song, Dongdong
Peng, Qing
Yu, Li
Wang, Xiaozheng
Jiang, Jian
Ye, Xiaolong
2024 5TH INFORMATION COMMUNICATION TECHNOLOGIES CONFERENCE, ICTC 2024, 2024, : 330 - 336
[40] Large language models and rheumatology: a comparative evaluation
Venerito, Vincenzo
Puttaswamy, Darshan
Iannone, Florenzo
Gupta, Latika
LANCET RHEUMATOLOGY, 2023, 5 (10): : E574 - E578

← 1 2 3 4 5 →