This is not a Dataset: A Large Negation Benchmark to Challenge Large Language Models

被引:0
|
作者
Garcia-Ferrero, Iker [1 ]
Altuna, Begona [1 ]
Alvez, Javier [2 ]
Gonzalez-Dios, Itziar [1 ]
Rigau, German [1 ]
机构
[1] Univ Basque Country UPV EHU, HiTZ Ctr Ixa, Leioa, Spain
[2] Univ Basque Country UPV EHU, LoRea Grp, Leioa, Spain
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Although large language models (LLMs) have apparently acquired a certain level of grammatical knowledge and the ability to make generalizations, they fail to interpret negation, a crucial step in Natural Language Processing. We try to clarify the reasons for the sub-optimal performance of LLMs understanding negation. We introduce a large semi-automatically generated dataset of circa 400,000 descriptive sentences about commonsense knowledge that can be true or false in which negation is present in about 2/3 of the corpus in different forms. We have used our dataset with the largest available open LLMs in a zero-shot approach to grasp their generalization and inference capability and we have also fine-tuned some of the models to assess whether the understanding of negation can be trained. Our findings show that, while LLMs are proficient at classifying affirmative sentences, they struggle with negative sentences and lack a deep understanding of negation, often relying on superficial cues. Although finetuning the models on negative sentences improves their performance, the lack of generalization in handling negation is persistent, highlighting the ongoing challenges of LLMs regarding negation understanding and generalization. The dataset and code are publicly available: https://github.com/hitz-zentroa/ This-is-not- a-Dataset
引用
收藏
页码:8596 / 8615
页数:20
相关论文
共 50 条
  • [31] A Large Benchmark Dataset for Individual Sheep Face Recognition
    Pang, Yue
    Yu, Wenbo
    Xuan, Chuanzhong
    Zhang, Yongan
    Wu, Pei
    AGRICULTURE-BASEL, 2023, 13 (09):
  • [32] UniTSyn: A Large-Scale Dataset Capable of Enhancing the Prowess of Large Language Models for Program Testing
    He, Yifeng
    Huang, Jiabo
    Rong, Yuyang
    Guo, Yiwen
    Wang, Ethan
    Chen, Hao
    PROCEEDINGS OF THE 33RD ACM SIGSOFT INTERNATIONAL SYMPOSIUM ON SOFTWARE TESTING AND ANALYSIS, ISSTA 2024, 2024, : 1061 - 1072
  • [33] MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models
    Liu, Xin
    Zhu, Yichen
    Gu, Jindong
    Lan, Yunshi
    Yang, Chao
    Qiao, Yu
    COMPUTER VISION - ECCV 2024, PT LVI, 2025, 15114 : 386 - 403
  • [34] A comprehensive evaluation of large Language models on benchmark biomedical text processing tasks
    Jahan, Israt
    Laskar, Md Tahmid Rahman
    Peng, Chun
    Huang, Jimmy Xiangji
    COMPUTERS IN BIOLOGY AND MEDICINE, 2024, 171
  • [35] Large Language Models for Scientific Question Answering: An Extensive Analysis of the SciQA Benchmark
    Lehmann, Jens
    Meloni, Antonello
    Motta, Enrico
    Osborne, Francesco
    Recupero, Diego Reforgiato
    Salatino, Angelo Antonio
    Vandati, Sahar
    SEMANTIC WEB, PT I, ESWC 2024, 2024, 14664 : 199 - 217
  • [36] ANALOGICAL - A Novel Benchmark for Long Text Analogy Evaluation in Large Language Models
    Wijesiriwardene, Thilini
    Wickramarachchi, Ruwan
    Gajera, Bimal G.
    Gowaikar, Shreeyash Mukul
    Gupta, Chandan
    Chadha, Aman
    Reganti, Aishwarya Naresh
    Sheth, Amit
    Das, Amitava
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 3534 - 3549
  • [37] DesignQA: A Multimodal Benchmark for Evaluating Large Language Models' Understanding of Engineering Documentation
    Doris, Anna C.
    Grandi, Daniele
    Tomich, Ryan
    Alam, Md Ferdous
    Ataei, Mohammadmehdi
    Cheong, Hyunmin
    Ahmed, Faez
    JOURNAL OF COMPUTING AND INFORMATION SCIENCE IN ENGINEERING, 2025, 25 (02)
  • [38] LEGALBENCH: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models
    Guha, Neel
    Nyarko, Julian
    Ho, Daniel E.
    Re, Christopher
    Chilton, Adam
    Narayana, Aditya
    Chohlas-Wood, Alex
    Peters, Austin
    Waldon, Brandon
    Rockmore, Daniel N.
    Zambrano, Diego
    Talisman, Dmitry
    Hoque, Enam
    Surani, Faiz
    Fagan, Frank
    Sarfaty, Galit
    Dickinson, Gregory M.
    Porat, Haggai
    Hegland, Jason
    Wu, Jessica
    Nudell, Joe
    Niklaus, Joel
    Nay, John
    Choi, Jonathan H.
    Tobia, Kevin
    Hagan, Margaret
    Ma, Megan
    Livermore, Michael
    Rasumov-Rahe, Nikon
    Holzenberger, Nils
    Kolt, Noam
    Henderson, Peter
    Rehaag, Sean
    Goel, Sharad
    Gao, Shang
    Williams, Spencer
    Gandhi, Sunny
    Zur, Tom
    Iyer, Varun
    Li, Zehua
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [39] SDFC dataset: a large-scale benchmark dataset for hyperspectral image classification
    Liwei Sun
    Junjie Zhang
    Jia Li
    Yueming Wang
    Dan Zeng
    Optical and Quantum Electronics, 2023, 55
  • [40] SDFC dataset: a large-scale benchmark dataset for hyperspectral image classification
    Sun, Liwei
    Zhang, Junjie
    Li, Jia
    Wang, Yueming
    Zeng, Dan
    OPTICAL AND QUANTUM ELECTRONICS, 2023, 55 (02)