Building a benchmark dataset for the Kurdish news question answering

被引:0
|
作者
Saeed, Ari M. [1 ]
机构
[1] Univ Halabja, Coll Sci, Comp Sci Dept, Halabja, Kurdistan Regio, Iraq
来源
DATA IN BRIEF | 2024年 / 57卷
关键词
Kurdish question answering system; Kurdish news dataset; Data mining; Text pre-processing; Machine learning;
D O I
10.1016/j.dib.2024.110916
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
This article presents the Kurdish News Question Answering Dataset (KNQAD). The texts are collected from various Kurdish news websites. The ParsHub software is used to extract data from different fields of news, such as social news, religion, sports, science, and economy. The dataset consists of 15,002 news paragraphs with question-answer pairs. For each news paragraph, one or more question-answer pairs are manually created based on the content of the paragraphs. The dataset is pre-processed by cleaning and normalizing the data. During the cleaning process, special characters and stop words are removed, and stemming is used as a normalization step. The distribution of each question type is presented in the KNQAD. Moreover, the complexity of the QA problem is analyzed in the KNQAD by using lexical similarity techniques between questions and answers. (c) 2024 The Author(s). Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/ )
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Question and Answer Classification in Czech Question Answering Benchmark Dataset
    Kusnirakova, Dasa
    Medved, Marek
    Horak, Ales
    PROCEEDINGS OF THE 11TH INTERNATIONAL CONFERENCE ON AGENTS AND ARTIFICIAL INTELLIGENCE (ICAART), VOL 2, 2019, : 701 - 706
  • [2] EgoVQA - An Egocentric Video Question Answering Benchmark Dataset
    Fan, Chenyou
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 4359 - 4366
  • [3] DISFL-QA: A Benchmark Dataset for Understanding Disfluencies in Question Answering
    Gupta, Aditya
    Xu, Jiacheng
    Upadhyay, Shyam
    Yang, Diyi
    Faruqui, Manaal
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 3309 - 3319
  • [4] ArchivalQA: A Large-scale Benchmark Dataset for Open-Domain Question Answering over Historical News Collections
    Wang, Jiexin
    Jatowt, Adam
    Yoshikawa, Masatoshi
    PROCEEDINGS OF THE 45TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '22), 2022, : 3025 - 3035
  • [5] Czech Question Answering with Extended SQAD v3.0 Benchmark Dataset
    Sabol, Radoslav
    Medved, Marek
    Horak, Ales
    RASLAN 2019: RECENT ADVANCES IN SLAVONIC NATURAL LANGUAGE PROCESSING, 2019, : 99 - 108
  • [6] Project PIAF: Building a Native French Question-Answering Dataset
    Keraron, Rachel
    Lancrenon, Guillaume
    Bras, Mathilde
    Allary, Frederic
    Moyse, Gilles
    Scialom, Thomas
    Soriano-Morales, Edmundo-Pavel
    Staiano, Jacopo
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 5481 - 5490
  • [7] Question-Answering in a Low-resourced Language: Benchmark Dataset and Models for Tigrinya
    Gaim, Fitsum
    Yang, Wonsuk
    Park, Hancheol
    Park, Jong C.
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 11857 - 11870
  • [8] Event-Oriented Visual Question Answering: The E-VQA Dataset and Benchmark
    Yang, Zhenguo
    Xiang, Jiale
    You, Jiuxiang
    Li, Qing
    Liu, Wenyin
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (10) : 10210 - 10223
  • [9] Dataset and Benchmark for Urdu Natural Scenes Text Detection, Recognition and Visual Question Answering
    Maryam, Hiba
    Fu, Ling
    Song, Jiajun
    Shafayet, Tajrian A. B. M.
    Luo, Qidi
    Bai, Xiang
    Liu, Yuliang
    DOCUMENT ANALYSIS AND RECOGNITION-ICDAR 2024, PT V, 2024, 14808 : 279 - 292
  • [10] Automatic question answering for multiple stakeholders, the epidemic question answering dataset
    Travis R. Goodwin
    Dina Demner-Fushman
    Kyle Lo
    Lucy Lu Wang
    Hoa T. Dang
    Ian M. Soboroff
    Scientific Data, 9