Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and Beyond

被引:0
|
作者
Xu, Fangzhi [1 ]
Lin, Qika [1 ]
Han, Jiawei [1 ]
Zhao, Tianzhe [1 ]
Liu, Jun [2 ]
Cambria, Erik [3 ]
机构
[1] Xi An Jiao Tong Univ, Sch Comp Sci & Technol, Key Lab Intelligent Networks & Net work Secur, Minist Educ, Xian 710049, Shaanxi, Peoples R China
[2] Shaanxi Prov Key Lab Big Data Knowledge Engn, Xian 710049, Shaanxi, Peoples R China
[3] Nanyang Technol Univ, Coll Comp & Data Sci, Singapore 639798, Singapore
基金
中国国家自然科学基金;
关键词
Cognition; Benchmark testing; Measurement; Large language models; Self-aware; Systematics; Redundancy; Knowledge engineering; Chatbots; Accuracy; Logical reasoning; large language model; deductive reasoning; inductive reasoning; abductive reasoning;
D O I
10.1109/TKDE.2025.3536008
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Logical reasoning consistently plays a fundamental and significant role in the domains of knowledge engineering and artificial intelligence. Recently, Large Language Models (LLMs) have emerged as a noteworthy innovation in natural language processing (NLP). However, the question of whether LLMs can effectively address the task of logical reasoning, which requires gradual cognitive inference similar to human intelligence, remains unanswered. To this end, we aim to bridge this gap and provide comprehensive evaluations in this paper. First, to offer systematic evaluations, we select fifteen typical logical reasoning datasets and organize them into deductive, inductive, abductive and mixed-form reasoning settings. Considering the comprehensiveness of evaluations, we include 3 early-era representative LLMs and 4 trending LLMs. Second, different from previous evaluations relying only on simple metrics (e.g., accuracy), we propose fine-level evaluations in objective and subjective manners, covering both answers and explanations, including answer correctness, explain correctness, explain completeness and explain redundancy. Additionally, to uncover the logical flaws of LLMs, problematic cases will be attributed to five error types from two dimensions, i.e., evidence selection process and reasoning process. Third, to avoid the influences of knowledge bias and concentrate purely on benchmarking the logical reasoning capability of LLMs, we propose a new dataset with neutral content. Based on the in-depth evaluations, this paper finally forms a general evaluation scheme of logical reasoning capability from six dimensions (i.e., Correct, Rigorous, Self-aware, Active, Oriented and No hallucination). It reflects the pros and cons of LLMs and gives guiding directions for future works.
引用
收藏
页码:1620 / 1634
页数:15
相关论文
共 50 条
  • [31] Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking
    Xu, Nan
    Wang, Fei
    Zhou, Ben
    Li, Bangzheng
    Xiao, Chaowei
    Chen, Muhao
    Findings of the Association for Computational Linguistics: NAACL 2024 - Findings, 2024, : 3526 - 3548
  • [32] A Survey on Evaluation of Large Language Models
    Chang, Yupeng
    Wang, Xu
    Wang, Jindong
    Wu, Yuan
    Yang, Linyi
    Zhu, Kaijie
    Chen, Hao
    Yi, Xiaoyuan
    Wang, Cunxiang
    Wang, Yidong
    Ye, Wei
    Zhang, Yue
    Chang, Yi
    Yu, Philip S.
    Yang, Qiang
    Xie, Xing
    ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2024, 15 (03)
  • [33] Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation
    Liu, Jiawei
    Xia, Chunqiu Steven
    Wang, Yuyao
    Zhang, Lingming
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [34] Large Language Models Are Clinical Reasoners: Reasoning-Aware Diagnosis Framework with Prompt-Generated Rationales
    Kwon, Taeyoon
    Ong, Kai Tzu-iunn
    Kang, Dongjin
    Moon, Seungjun
    Lee, Jeong Ryong
    Hwang, Dosik
    Sohn, Beomseok
    Sim, Yongsik
    Lee, Dongha
    Yeo, Jinyoung
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 18417 - 18425
  • [35] Large Language Models in Orthopaedic Publications: The Good, the Bad and the Ugly
    Wascher, Daniel C.
    Ollivier, Matthieu
    AMERICAN JOURNAL OF SPORTS MEDICINE, 2024, 52 (09): : 2193 - 2195
  • [36] ChatGPT for good? On opportunities and challenges of large language models for education
    Kasneci, Enkelejda
    Sessler, Kathrin
    Kuechemann, Stefan
    Bannert, Maria
    Dementieva, Daryna
    Fischer, Frank
    Gasser, Urs
    Groh, Georg
    Guennemann, Stephan
    Huellermeier, Eyke
    Krusche, Stepha
    Kutyniok, Gitta
    Michaeli, Tilman
    Nerdel, Claudia
    Pfeffer, Juergen
    Poquet, Oleksandra
    Sailer, Michael
    Schmidt, Albrecht
    Seidel, Tina
    Stadler, Matthias
    Weller, Jochen
    Kuhn, Jochen
    Kasneci, Gjergji
    LEARNING AND INDIVIDUAL DIFFERENCES, 2023, 103
  • [37] Large Language Models in Orthopaedic Publications: The Good, the Bad and the Ugly
    Wascher, Daniel C.
    Ollivier, Matthieu
    ORTHOPAEDIC JOURNAL OF SPORTS MEDICINE, 2024, 12 (08)
  • [38] How good are large language models at product risk assessment?
    Collier, Zachary A.
    Gruss, Richard J.
    Abrahams, Alan S.
    RISK ANALYSIS, 2024,
  • [39] A Comprehensive Survey of Datasets for Large Language Model Evaluation
    Lu, Yuting
    Sun, Chao
    Yan, Yuchao
    Zhu, Hegong
    Song, Dongdong
    Peng, Qing
    Yu, Li
    Wang, Xiaozheng
    Jiang, Jian
    Ye, Xiaolong
    2024 5TH INFORMATION COMMUNICATION TECHNOLOGIES CONFERENCE, ICTC 2024, 2024, : 330 - 336
  • [40] Large language models and rheumatology: a comparative evaluation
    Venerito, Vincenzo
    Puttaswamy, Darshan
    Iannone, Florenzo
    Gupta, Latika
    LANCET RHEUMATOLOGY, 2023, 5 (10): : E574 - E578