Building a benchmark dataset for the Kurdish news question answering

被引:0
|
作者
Saeed, Ari M. [1 ]
机构
[1] Univ Halabja, Coll Sci, Comp Sci Dept, Halabja, Kurdistan Regio, Iraq
来源
DATA IN BRIEF | 2024年 / 57卷
关键词
Kurdish question answering system; Kurdish news dataset; Data mining; Text pre-processing; Machine learning;
D O I
10.1016/j.dib.2024.110916
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
This article presents the Kurdish News Question Answering Dataset (KNQAD). The texts are collected from various Kurdish news websites. The ParsHub software is used to extract data from different fields of news, such as social news, religion, sports, science, and economy. The dataset consists of 15,002 news paragraphs with question-answer pairs. For each news paragraph, one or more question-answer pairs are manually created based on the content of the paragraphs. The dataset is pre-processed by cleaning and normalizing the data. During the cleaning process, special characters and stop words are removed, and stemming is used as a normalization step. The distribution of each question type is presented in the KNQAD. Moreover, the complexity of the QA problem is analyzed in the KNQAD by using lexical similarity techniques between questions and answers. (c) 2024 The Author(s). Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/ )
引用
收藏
页数:12
相关论文
共 50 条
  • [21] Building a Question-Answering Corpus Using Social Media and News Articles
    Cavalin, Paulo
    Figueiredo, Flavio
    de Bayser, Maira
    Moyano, Luis
    Candello, Heloisa
    Appel, Ana
    Souza, Renan
    COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE (PROPOR 2016), 2016, 9727 : 353 - 358
  • [22] Natural Questions: A Benchmark for Question Answering Research
    Kwiatkowski T.
    Palomaki J.
    Redfield O.
    Collins M.
    Parikh A.
    Alberti C.
    Epstein D.
    Polosukhin I.
    Devlin J.
    Lee K.
    Toutanova K.
    Jones L.
    Kelcey M.
    Chang M.-W.
    Dai A.M.
    Uszkoreit J.
    Le Q.
    Petrov S.
    Transactions of the Association for Computational Linguistics, 2019, 7 : 453 - 466
  • [23] Natural Questions: A Benchmark for Question Answering Research
    Kwiatkowski, Tom
    Palomaki, Jennimaria
    Redfield, Olivia
    Collins, Michael
    Parikh, Ankur
    Alberti, Chris
    Epstein, Danielle
    Polosukhin, Illia
    Devlin, Jacob
    Lee, Kenton
    Toutanova, Kristina
    Jones, Llion
    Kelcey, Matthew
    Chang, Ming-Wei
    Dai, Andrew M.
    Uszkoreit, Jakob
    Quoc Le
    Petrov, Slav
    TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2019, 7 : 453 - 466
  • [24] BEnQA: A Question Answering Benchmark for Bengali and English
    Shafayat, Sheikh
    Hasan, H. M. Quamran
    Mahim, Minhajur Rahman Chowdhury
    Putri, Rifki Afina
    Thorne, James
    Oh, Alice
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 1158 - 1177
  • [25] TimelineQA: A Benchmark for Question Answering over Timelines
    Tan, Wang-Chiew
    Dwivedi-Yu, Jane
    Li, Yuliang
    Mathias, Lambert
    Saeidi, Marzieh
    Yan, Jing Nathan
    Halevy, Alon Y.
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 77 - 91
  • [26] KoBBQ: Korean Bias Benchmark for Question Answering
    Jin, Jiho
    Kim, Jiseon
    Lee, Nayeon
    Yoo, Haneul
    Oh, Alice
    Lee, Hwaran
    TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2024, 12 : 507 - 524
  • [27] PubMedQA: A Dataset for Biomedical Research Question Answering
    Jin, Qiao
    Dhingra, Bhuwan
    Liu, Zhengping
    Cohen, William W.
    Lu, Xinghua
    2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 2567 - 2577
  • [28] ArabicaQA: A Comprehensive Dataset for Arabic Question Answering
    Abdallah, Abdelrahman
    Kasem, Mahmoud
    Abdalla, Mahmoud
    Mahmoud, Mohamed
    Elkasaby, Mohamed
    Elbendary, Yasser
    Jatowt, Adam
    PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 2049 - 2059
  • [29] VQuAD: Video Question Answering Diagnostic Dataset
    Gupta, Vivek
    Patro, Badri N.
    Parihar, Hemant
    Namboodiri, Vinay P.
    2022 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WORKSHOPS (WACVW 2022), 2022, : 282 - 291
  • [30] TutorialVQA: Question Answering Dataset for Tutorial Videos
    Colas, Anthony
    Kim, Seokhwan
    Dernoncourt, Franck
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 5450 - 5455