This is not a Dataset: A Large Negation Benchmark to Challenge Large Language Models

被引:0
|
作者
Garcia-Ferrero, Iker [1 ]
Altuna, Begona [1 ]
Alvez, Javier [2 ]
Gonzalez-Dios, Itziar [1 ]
Rigau, German [1 ]
机构
[1] Univ Basque Country UPV EHU, HiTZ Ctr Ixa, Leioa, Spain
[2] Univ Basque Country UPV EHU, LoRea Grp, Leioa, Spain
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Although large language models (LLMs) have apparently acquired a certain level of grammatical knowledge and the ability to make generalizations, they fail to interpret negation, a crucial step in Natural Language Processing. We try to clarify the reasons for the sub-optimal performance of LLMs understanding negation. We introduce a large semi-automatically generated dataset of circa 400,000 descriptive sentences about commonsense knowledge that can be true or false in which negation is present in about 2/3 of the corpus in different forms. We have used our dataset with the largest available open LLMs in a zero-shot approach to grasp their generalization and inference capability and we have also fine-tuned some of the models to assess whether the understanding of negation can be trained. Our findings show that, while LLMs are proficient at classifying affirmative sentences, they struggle with negative sentences and lack a deep understanding of negation, often relying on superficial cues. Although finetuning the models on negative sentences improves their performance, the lack of generalization in handling negation is persistent, highlighting the ongoing challenges of LLMs regarding negation understanding and generalization. The dataset and code are publicly available: https://github.com/hitz-zentroa/ This-is-not- a-Dataset
引用
收藏
页码:8596 / 8615
页数:20
相关论文
共 50 条
  • [1] Towards a benchmark dataset for large language models in the context of process automation
    Tizaoui, Tejennour
    Tan, Ruomu
    DIGITAL CHEMICAL ENGINEERING, 2024, 13
  • [2] DDFAV: Remote Sensing Large Vision Language Models Dataset and Evaluation Benchmark
    Li, Haodong
    Zhang, Xiaofeng
    Qu, Haicheng
    REMOTE SENSING, 2025, 17 (04)
  • [3] A bilingual benchmark for evaluating large language models
    Alkaoud, Mohamed
    PEERJ COMPUTER SCIENCE, 2024, 10
  • [4] CORECODE: A Common Sense Annotated Dialogue Dataset with Benchmark Tasks for Chinese Large Language Models
    Shi, Dan
    You, Chaobin
    Huang, Jiantao
    Li, Taihao
    Xiong, Deyi
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 17, 2024, : 18952 - 18960
  • [5] HUMBI: A Large Multiview Dataset of Human Body Expressions and Benchmark Challenge
    Yoon, Jae Shin
    Yu, Zhixuan
    Park, Jaesik
    Park, Hyun Soo
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (01) : 623 - 640
  • [6] Causal Dataset Discovery with Large Language Models
    Liu, Junfei
    Sun, Shaotong
    Nargesian, Fatemeh
    WORKSHOP ON HUMAN-IN-THE-LOOP DATA ANALYTICS, HILDA 2024, 2024,
  • [7] Construction of a Japanese Financial Benchmark for Large Language Models
    Preferred Networks, Inc., Tokyo, Japan
    Jt. Workshop Financ. Technol. Nat. Lang. Process., Knowl. Discov. from Unstructured Data Financ. Serv. Econ. Nat. Lang. Process., FinNLP-KDF-ECONLP LREC-COLING - Workshop Proc., (1-9):
  • [8] HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models
    Li, Junyi
    Cheng, Xiaoxue
    Zhao, Wayne Xin
    Nie, Jian-Yun
    Wen, Ji-Rong
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 6449 - 6464
  • [9] Understanding the Dataset Practitioners Behind Large Language Models
    Qian, Crystal
    Reif, Emily
    Kahng, Minsuk
    EXTENDED ABSTRACTS OF THE 2024 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS, CHI 2024, 2024,
  • [10] A Chinese Dataset for Evaluating the Safeguards in Large Language Models
    Wang, Yuxia
    Zhai, Zenan
    Li, Haonan
    Han, Xudong
    Lin, Lizhi
    Zhang, Zhenxuan
    Zhao, Jingru
    Nakov, Preslav
    Baldwin, Timothy
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 3106 - 3119