Building a benchmark dataset for the Kurdish news question answering

被引:0
|
作者
Saeed, Ari M. [1 ]
机构
[1] Univ Halabja, Coll Sci, Comp Sci Dept, Halabja, Kurdistan Regio, Iraq
来源
DATA IN BRIEF | 2024年 / 57卷
关键词
Kurdish question answering system; Kurdish news dataset; Data mining; Text pre-processing; Machine learning;
D O I
10.1016/j.dib.2024.110916
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
This article presents the Kurdish News Question Answering Dataset (KNQAD). The texts are collected from various Kurdish news websites. The ParsHub software is used to extract data from different fields of news, such as social news, religion, sports, science, and economy. The dataset consists of 15,002 news paragraphs with question-answer pairs. For each news paragraph, one or more question-answer pairs are manually created based on the content of the paragraphs. The dataset is pre-processed by cleaning and normalizing the data. During the cleaning process, special characters and stop words are removed, and stemming is used as a normalization step. The distribution of each question type is presented in the KNQAD. Moreover, the complexity of the QA problem is analyzed in the KNQAD by using lexical similarity techniques between questions and answers. (c) 2024 The Author(s). Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/ )
引用
收藏
页数:12
相关论文
共 50 条
  • [41] SPARTQA: A Textual Question Answering Benchmark for Spatial Reasoning
    Mirzaee, Roshanak
    Faghihi, Hossein Rajaby
    Ning, Qiang
    Kordjamshidi, Parisa
    2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 4582 - 4598
  • [42] AgXQA: A benchmark for advanced Agricultural Extension question answering
    Kpodo, Josue
    Kordjamshidi, Parisa
    Nejadhashemi, A. Pouyan
    COMPUTERS AND ELECTRONICS IN AGRICULTURE, 2024, 225
  • [43] The SciQA Scientific Question Answering Benchmark for Scholarly Knowledge
    Auer, Soeren
    Barone, Dante A. C.
    Bartz, Cassiano
    Cortes, Eduardo G.
    Jaradeh, Mohamad Yaser
    Karras, Oliver
    Koubarakis, Manolis
    Mouromtsev, Dmitry
    Pliukhin, Dmitrii
    Radyush, Daniil
    Shilin, Ivan
    Stocker, Markus
    Tsalapati, Eleni
    SCIENTIFIC REPORTS, 2023, 13 (01)
  • [44] IFND: a benchmark dataset for fake news detection
    Sharma, Dilip Kumar
    Garg, Sonal
    COMPLEX & INTELLIGENT SYSTEMS, 2023, 9 (03) : 2843 - 2863
  • [45] IFND: a benchmark dataset for fake news detection
    Dilip Kumar Sharma
    Sonal Garg
    Complex & Intelligent Systems, 2023, 9 : 2843 - 2863
  • [46] Temporal Question Answering in News Article Collections
    Jatowt, Adam
    COMPANION PROCEEDINGS OF THE WEB CONFERENCE 2022, WWW 2022 COMPANION, 2022, : 895 - 895
  • [47] Kurdish News Dataset Headlines (KNDH) through multiclass classification
    Badawi, Soran
    Saeed, Ari M.
    Ahmed, Sara A.
    Abdalla, Peshraw Ahmed
    Hassan, Diyari A.
    DATA IN BRIEF, 2023, 48
  • [48] Question answering on large news video archive
    Chua, TS
    ISPA 2003: PROCEEDINGS OF THE 3RD INTERNATIONAL SYMPOSIUM ON IMAGE AND SIGNAL PROCESSING AND ANALYSIS, PTS 1 AND 2, 2003, : 289 - 294
  • [49] Dataset bias: A case study for visual question answering
    Das A.
    Anjum S.
    Gurari D.
    Proceedings of the Association for Information Science and Technology, 2019, 56 (01): : 58 - 67
  • [50] Improvisation of Dataset Efficiency in Visual Question Answering Domain
    Mohamed, Sheerin Sitara Noor
    Srinivasan, Kavitha
    STATISTICS AND APPLICATIONS, 2022, 20 (02): : 279 - 289