Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and Beyond

被引:0
|
作者
Xu, Fangzhi [1 ]
Lin, Qika [1 ]
Han, Jiawei [1 ]
Zhao, Tianzhe [1 ]
Liu, Jun [2 ]
Cambria, Erik [3 ]
机构
[1] Xi An Jiao Tong Univ, Sch Comp Sci & Technol, Key Lab Intelligent Networks & Net work Secur, Minist Educ, Xian 710049, Shaanxi, Peoples R China
[2] Shaanxi Prov Key Lab Big Data Knowledge Engn, Xian 710049, Shaanxi, Peoples R China
[3] Nanyang Technol Univ, Coll Comp & Data Sci, Singapore 639798, Singapore
基金
中国国家自然科学基金;
关键词
Cognition; Benchmark testing; Measurement; Large language models; Self-aware; Systematics; Redundancy; Knowledge engineering; Chatbots; Accuracy; Logical reasoning; large language model; deductive reasoning; inductive reasoning; abductive reasoning;
D O I
10.1109/TKDE.2025.3536008
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Logical reasoning consistently plays a fundamental and significant role in the domains of knowledge engineering and artificial intelligence. Recently, Large Language Models (LLMs) have emerged as a noteworthy innovation in natural language processing (NLP). However, the question of whether LLMs can effectively address the task of logical reasoning, which requires gradual cognitive inference similar to human intelligence, remains unanswered. To this end, we aim to bridge this gap and provide comprehensive evaluations in this paper. First, to offer systematic evaluations, we select fifteen typical logical reasoning datasets and organize them into deductive, inductive, abductive and mixed-form reasoning settings. Considering the comprehensiveness of evaluations, we include 3 early-era representative LLMs and 4 trending LLMs. Second, different from previous evaluations relying only on simple metrics (e.g., accuracy), we propose fine-level evaluations in objective and subjective manners, covering both answers and explanations, including answer correctness, explain correctness, explain completeness and explain redundancy. Additionally, to uncover the logical flaws of LLMs, problematic cases will be attributed to five error types from two dimensions, i.e., evidence selection process and reasoning process. Third, to avoid the influences of knowledge bias and concentrate purely on benchmarking the logical reasoning capability of LLMs, we propose a new dataset with neutral content. Based on the in-depth evaluations, this paper finally forms a general evaluation scheme of logical reasoning capability from six dimensions (i.e., Correct, Rigorous, Self-aware, Active, Oriented and No hallucination). It reflects the pros and cons of LLMs and gives guiding directions for future works.
引用
收藏
页码:1620 / 1634
页数:15
相关论文
共 50 条
  • [1] Large Language Models Are Neurosymbolic Reasoners
    Fang, Meng
    Deng, Shilong
    Zhang, Yudi
    Shi, Zijing
    Chen, Ling
    Pechenizkiy, Mykola
    Wang, Jun
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 17985 - 17993
  • [2] Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators
    Chen, Liang
    Deng, Yang
    Bian, Yatao
    Qin, Zeyu
    Wu, Bingzhe
    Chua, Tat-Seng
    Wong, Kam-Fai
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 6325 - 6341
  • [3] Large Language Models Are Not Strong Abstract Reasoners
    Gendron, Gael
    Bao, Qiming
    Witbrock, Michael
    Dobbie, Gillian
    PROCEEDINGS OF THE THIRTY-THIRD INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2024, 2024, : 6270 - 6278
  • [4] Large Language Models are Zero-Shot Reasoners
    Kojima, Takeshi
    Gu, Shixiang Shane
    Reid, Machel
    Matsuo, Yutaka
    Iwasawa, Yusuke
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
  • [5] ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning
    Lai, Viet Dac
    Nguyen, Nghia Trung
    Ben Veyseh, Amir Pouran
    Man, Hieu
    Dernoncourt, Franck
    Bu, Trung
    Nguyen, Thien Huu
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 13171 - 13189
  • [6] Large Language Models are few(1)-shot Table Reasoners
    Chen, Wenhu
    17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 1120 - 1130
  • [7] Large Language Models are Better Reasoners with Self-Verification
    Weng, Yixuan
    Zhu, Minjun
    Xia, Fei
    Li, Bin
    He, Shizhu
    Liu, Shengping
    Sun, Bin
    Liu, Kang
    Zhao, Jun
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 2550 - 2575
  • [8] A Comprehensive Evaluation of Quantization Strategies for Large Language Models
    Jin, Renren
    Du, Jiangcun
    Huang, Wuwei
    Liu, Wei
    Lu, Jian
    Wang, Bin
    Xiong, Deyi
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 12186 - 12215
  • [9] Medical foundation large language models for comprehensive text analysis and beyond
    Xie, Qianqian
    Chen, Qingyu
    Chen, Aokun
    Peng, Cheng
    Hu, Yan
    Lin, Fongci
    Peng, Xueqing
    Huang, Jimin
    Zhang, Jeffrey
    Keloth, Vipina
    Zhou, Xinyu
    Qian, Lingfei
    He, Huan
    Shung, Dennis
    Ohno-Machado, Lucila
    Wu, Yonghui
    Xu, Hua
    Bian, Jiang
    NPJ DIGITAL MEDICINE, 2025, 8 (01):
  • [10] Large Language Models are Temporal and Causal Reasoners for Video Question Answering
    Ko, Dohwan
    Lee, Ji Soo
    Kang, Wooyoung
    Roh, Byungseok
    Kim, Hyunwoo J.
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 4300 - 4316