The plausibility machine commonsense (PMC) dataset: A massively crowdsourced human-annotated dataset for studying plausibility in large language models

被引:0
|
作者
Nananukul, Navapat [1 ]
Shen, Ke [1 ]
Kejriwal, Mayank [1 ]
机构
[1] Univ Southern Calif, Inst Informat Sci, 4676 Admiralty Way,Suite 1001, Marina Del Rey, CA 90292 USA
来源
DATA IN BRIEF | 2024年 / 57卷
关键词
Commonsense benchmark; Large language models; Machine annotation; Human annotation;
D O I
10.1016/j.dib.2024.110869
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Commonsense reasoning has emerged as a challenging problem in Artificial Intelligence (AI). However, one area of commonsense reasoning that has not received nearly as much attention in the AI research community is plausibility assessment , which focuses on determining the likelihood of commonsense statements. Human-annotated benchmarks are essential for advancing research in this nascent area, as they enable researchers to develop and evaluate AI models effectively. Because plausibility is a subjective concept, it is important to obtain nuanced annotations, rather than a binary label of 'plausible' or 'implausible'. Furthermore, it is also important to obtain multiple human annotations for a given statement, to ensure validity of the labels. In this data article, we describe the process of re-annotating an existing commonsense plausibility dataset (SemEval-2020 Task 4) using large-scale crowdsourcing on the Amazon Mechanical Turk platform. We obtain 10,0 0 0 unique annotations on a corpus of 20 0 0 sentences (five independent annotations per sentence). Based on these labels, each was labelled as plausible, implausible, or ambiguous . Next, we prompted the GPT-3.5 and GPT-4 models developed by OpenAI. Sentences from the human-annotated files were fed into the models using custom prompt templates, and the models' generated labels were used to determine if they were aligned with those output by humans. The PMC-Dataset is meant to serve as a rich resource for analysing and comparing human and machine commonsense reasoning capabilities, specifically on plausibility. Researchers can utilise this dataset to train, fine-tune, and evaluate AI models on plausibility. Applications include: determining the likelihood of everyday events, assessing the realism of hypothetical scenarios, and distinguishing between plausible and implausible statements in commonsense text. Ultimately, we intend for the dataset to support ongoing AI research by offering a robust foundation for developing models that are better aligned with human commonsense reasoning. (c) 2024 The Author(s). Published by Elsevier Inc. This is an open access article under the CC BY-NC license ( http://creativecommons.org/licenses/by-nc/4.0/ )
引用
收藏
页数:6
相关论文
共 8 条
  • [1] Human-annotated dataset for social media sentiment analysis for Albanian language
    Kadriu, Fatbardh
    Murtezaj, Doruntina
    Gashi, Fatbardh
    Ahmedi, Lule
    Kurti, Arianit
    Kastrati, Zenun
    DATA IN BRIEF, 2022, 43
  • [2] DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis
    Pfitzmann, Birgit
    Auer, Christoph
    Dolfi, Michele
    Nassar, Ahmed S.
    Staar, Peter
    PROCEEDINGS OF THE 28TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2022, 2022, : 3743 - 3751
  • [3] New Human-Annotated Dataset of Czech Health Records for Training Medical Concept Recognition Models
    Anetta, Kristof
    Horak, Ales
    TEXT, SPEECH, AND DIALOGUE, TSD 2024, PT I, 2024, 15048 : 110 - 120
  • [4] Annotated dataset creation through large language models for non-english medical NLP
    Frei, Johann
    Kramer, Frank
    JOURNAL OF BIOMEDICAL INFORMATICS, 2023, 145
  • [5] CORECODE: A Common Sense Annotated Dialogue Dataset with Benchmark Tasks for Chinese Large Language Models
    Shi, Dan
    You, Chaobin
    Huang, Jiantao
    Li, Taihao
    Xiong, Deyi
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 17, 2024, : 18952 - 18960
  • [6] TGEA 2.0: A Large-Scale Diagnostically Annotated Dataset with Benchmark Tasks for Text Generation of Pretrained Language Models
    Ge, Huibin
    Zhao, Xiaohu
    Liu, Chuang
    Zeng, Yulong
    Liu, Qun
    Xiong, Deyi
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
  • [7] Better models of human high-level visual cortex emerge from natural language supervision with a large and diverse dataset
    Aria Y. Wang
    Kendrick Kay
    Thomas Naselaris
    Michael J. Tarr
    Leila Wehbe
    Nature Machine Intelligence, 2023, 5 : 1415 - 1426
  • [8] Better models of human high-level visual cortex emerge from natural language supervision with a large and diverse dataset
    Wang, Aria Y.
    Kay, Kendrick
    Naselaris, Thomas
    Tarr, Michael J.
    Wehbe, Leila
    NATURE MACHINE INTELLIGENCE, 2023, 5 (12) : 1415 - 1426