NAFIS: A Gold Standard Corpus for Arabic Stemmers Evaluation

被引:0
|
作者
Namly, Driss [1 ]
Tajmout, Rachida [1 ]
Bouzoubaa, Karim [1 ]
Abouenour, Lahsen [1 ]
机构
[1] Mohammed V Univ, Mohammadia Sch Engineers, Rabat, Morocco
关键词
component; Arabic language; Arabic stemming; Stemmers evaluation; Evaluation corpus; Gold Standard Corpus;
D O I
暂无
中图分类号
F [经济];
学科分类号
02 ;
摘要
Arabic stemming as an important pre-processing task in Arabic natural language processing services and applications experience two serious deficiencies: "unique stemming solution" and "stemmers' performance inconsistency". These defects are mainly caused by the absence of a Gold Standard Corpus. Defined as a collection of texts stored in an electronic format, selected to be representative of a particular language, collection or genre, manually annotated and enriched with additional linguistic information, such corpus is used in stemmers benchmarking works. This paper provides a sight on NAFIS (Normalized Arabic Fragments for Inestimable Stemming), an Arabic stemming gold standard corpus. We describe NAFIS building methodology and we use it as an evaluation corpus in a benchmarking exercise.
引用
收藏
页码:1868 / 1877
页数:10
相关论文
共 50 条
  • [21] The design, construction and evaluation of annotated Arabic cyberbullying corpus
    Shannag, Fatima
    Hammo, Bassam H.
    Faris, Hossam
    EDUCATION AND INFORMATION TECHNOLOGIES, 2022, 27 (08) : 10977 - 11023
  • [23] Modern Standard Arabic speech disorders corpus for digital speech processing applications
    Alqudah A.A.M.
    Alshraideh M.A.M.
    Abushariah M.A.M.
    Sharieh A.A.S.
    Int J Speech Technol, 2024, 1 (157-170): : 157 - 170
  • [24] From the standard arabic to the dialectal arabic: projection of corpus and linguistic resources for the automatic treatment of the oral in the Tunisian media
    Boujelbane, Rahma
    Ellouze, Mariem
    Bechet, Frederic
    Belguith, Lamia
    TRAITEMENT AUTOMATIQUE DES LANGUES, 2014, 55 (02): : 73 - 96
  • [25] Baselines for Demographic Inference on a New Gold Standard Twitter Corpus
    Radford, Jason
    Horgan, Luke
    Lazer, David
    2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 4822 - 4823
  • [26] Crowdsourcing an OCR Gold Standard for a German and French Heritage Corpus
    Clematide, Simon
    Furrer, Lenz
    Volk, Martin
    LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2016, : 975 - 982
  • [27] A method for determining the number of documents needed for a gold standard corpus
    Juckett, David
    JOURNAL OF BIOMEDICAL INFORMATICS, 2012, 45 (03) : 460 - 470
  • [28] Towards the Construction of a Gold Standard Biomedical Corpus for the Romanian Language
    Mitrofan, Maria
    Mititelu, Verginica Barbu
    Mitrofan, Grigorina
    DATA, 2018, 3 (04):
  • [29] Annotated Chemical Patent Corpus: A Gold Standard for Text Mining
    Akhondi, Saber A.
    Klenner, Alexander G.
    Tyrchan, Christian
    Manchala, Anil K.
    Boppana, Kiran
    Lowe, Daniel
    Zimmermann, Marc
    Jagarlapudi, Sarma A. R. P.
    Sayle, Roger
    Kors, Jan A.
    Muresan, Sorel
    PLOS ONE, 2014, 9 (09):
  • [30] Evaluation of an Arabic Speech Corpus of Emotions: A Perceptual and Statistical Analysis
    Meftah, Ali Hamid
    Alotaibi, Yousef Ajami
    Selouani, Sid-Ahmed
    IEEE ACCESS, 2018, 6 : 72845 - 72861