Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and Beyond

被引:0
|
作者
Xu, Fangzhi [1 ]
Lin, Qika [1 ]
Han, Jiawei [1 ]
Zhao, Tianzhe [1 ]
Liu, Jun [2 ]
Cambria, Erik [3 ]
机构
[1] Xi An Jiao Tong Univ, Sch Comp Sci & Technol, Key Lab Intelligent Networks & Net work Secur, Minist Educ, Xian 710049, Shaanxi, Peoples R China
[2] Shaanxi Prov Key Lab Big Data Knowledge Engn, Xian 710049, Shaanxi, Peoples R China
[3] Nanyang Technol Univ, Coll Comp & Data Sci, Singapore 639798, Singapore
基金
中国国家自然科学基金;
关键词
Cognition; Benchmark testing; Measurement; Large language models; Self-aware; Systematics; Redundancy; Knowledge engineering; Chatbots; Accuracy; Logical reasoning; large language model; deductive reasoning; inductive reasoning; abductive reasoning;
D O I
10.1109/TKDE.2025.3536008
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Logical reasoning consistently plays a fundamental and significant role in the domains of knowledge engineering and artificial intelligence. Recently, Large Language Models (LLMs) have emerged as a noteworthy innovation in natural language processing (NLP). However, the question of whether LLMs can effectively address the task of logical reasoning, which requires gradual cognitive inference similar to human intelligence, remains unanswered. To this end, we aim to bridge this gap and provide comprehensive evaluations in this paper. First, to offer systematic evaluations, we select fifteen typical logical reasoning datasets and organize them into deductive, inductive, abductive and mixed-form reasoning settings. Considering the comprehensiveness of evaluations, we include 3 early-era representative LLMs and 4 trending LLMs. Second, different from previous evaluations relying only on simple metrics (e.g., accuracy), we propose fine-level evaluations in objective and subjective manners, covering both answers and explanations, including answer correctness, explain correctness, explain completeness and explain redundancy. Additionally, to uncover the logical flaws of LLMs, problematic cases will be attributed to five error types from two dimensions, i.e., evidence selection process and reasoning process. Third, to avoid the influences of knowledge bias and concentrate purely on benchmarking the logical reasoning capability of LLMs, we propose a new dataset with neutral content. Based on the in-depth evaluations, this paper finally forms a general evaluation scheme of logical reasoning capability from six dimensions (i.e., Correct, Rigorous, Self-aware, Active, Oriented and No hallucination). It reflects the pros and cons of LLMs and gives guiding directions for future works.
引用
收藏
页码:1620 / 1634
页数:15
相关论文
共 50 条
  • [41] Automatic Evaluation of Attribution by Large Language Models
    Yue, Xiang
    Wang, Boshi
    Chen, Ziru
    Zhang, Kai
    Su, Yu
    Sun, Huan
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 4615 - 4635
  • [42] Beyond the hype: Unveiling the challenges of large language models in urology
    Kwong, Jethro C. C.
    Nguyen, David-Dan
    Khondker, Adree
    Li, Tiange
    CUAJ-CANADIAN UROLOGICAL ASSOCIATION JOURNAL, 2024, 18 (10): : 333 - 334
  • [43] A Comprehensive Analysis of Various Tokenizers for Arabic Large Language Models
    Qarah, Faisal
    Alsanoosy, Tawfeeq
    APPLIED SCIENCES-BASEL, 2024, 14 (13):
  • [44] Evaluating OpenAI Large Language Models for Generating Logical Abstractions of Technical Requirements Documents
    Perko, Alexander
    Wotawa, Franz
    2024 IEEE 24TH INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY AND SECURITY, QRS, 2024, : 238 - 249
  • [45] Logical Closed Loop: Uncovering Object Hallucinations in Large Vision-Language Models
    Wu, Junfei
    Liu, Qiang
    Wang, Ding
    Zhang, Jinghao
    Wu, Shu
    Wang, Liang
    Tan, Tieniu
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 6944 - 6962
  • [46] Beyond the limitations of any imaginable mechanism: Large language models and psycholinguistics
    Houghton, Conor
    Kazanina, Nina
    Sukumaran, Priyanka
    BEHAVIORAL AND BRAIN SCIENCES, 2023, 46
  • [47] A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment
    Wu, Tianhe
    Ma, Kede
    Liang, Jie
    Yang, Yujiu
    Zhang, Lei
    COMPUTER VISION - ECCV 2024, PT LXXIV, 2025, 15132 : 143 - 160
  • [48] A comprehensive review of large language models: issues and solutions in learning environments
    Shahzad, Tariq
    Mazhar, Tehseen
    Tariq, Muhammad Usman
    Ahmad, Wasim
    Ouahada, Khmaies
    Hamam, Habib
    DISCOVER SUSTAINABILITY, 2025, 6 (01):
  • [49] Comprehensive testing of large language models for extraction of structured data in pathology
    Bastian Grothey
    Jan Odenkirchen
    Adnan Brkic
    Birgid Schömig-Markiefka
    Alexander Quaas
    Reinhard Büttner
    Yuri Tolkach
    Communications Medicine, 5 (1):
  • [50] Unlocking the Black Box? A Comprehensive Exploration of Large Language Models in Rehabilitation
    Bonnechere, Bruno
    AMERICAN JOURNAL OF PHYSICAL MEDICINE & REHABILITATION, 2024, 103 (06) : 532 - 537