Curras: an annotated corpus for the Palestinian Arabic dialect

被引:0
|
作者
Mustafa Jarrar
Nizar Habash
Faeq Alrimawi
Diyam Akra
Nasser Zalmout
机构
[1] Birzeit University,
[2] New York University Abu Dhabi,undefined
来源
关键词
Palestinian Arabic; Palestinian corpus; Arabic morphology; Conventional Orthography for Dialectal Arabic; Dialectal Arabic; Word annotation;
D O I
暂无
中图分类号
学科分类号
摘要
In this article we present Curras, the first morphologically annotated corpus of the Palestinian Arabic dialect. Palestinian Arabic is one of the many primarily spoken dialects of the Arabic language. Arabic dialects are generally under-resourced compared to Modern Standard Arabic, the primarily written and official form of Arabic. We start in the article with a background description that situates Palestinian Arabic linguistically and historically and compares it to Modern Standard Arabic and Egyptian Arabic in terms of phonological, morphological, orthographic, and lexical variations. We then describe the methodology we developed to collect Palestinian Arabic text to guarantee a variety of representative domains and genres. We also discuss the annotation process we used, which extended previous efforts for annotation guideline development, and utilized existing automatic annotation solutions for Standard Arabic and Egyptian Arabic. The annotation guidelines and annotation meta-data are described in detail. The Curras Palestinian Arabic corpus consists of more than 56 K tokens, which are annotated with rich morphological and lexical features. The inter-annotator agreement results indicate a high degree of consistency.
引用
收藏
页码:745 / 775
页数:30
相关论文
共 50 条
  • [1] Curras: an annotated corpus for the Palestinian Arabic dialect
    Jarrar, Mustafa
    Habash, Nizar
    Alrimawi, Faeq
    Akra, Diyam
    Zalmout, Nasser
    LANGUAGE RESOURCES AND EVALUATION, 2017, 51 (03) : 745 - 775
  • [3] The MADAR Arabic Dialect Corpus and Lexicon
    Bouamor, Honda
    Habash, Nizar
    Salameh, Mohammad
    Zaghouani, Wajdi
    Rambow, Owen
    Abdulrahim, Dana
    Obeid, Ossama
    Khalifa, Salam
    Eryani, Fadhl
    Erdmann, Alexander
    Oflazer, Kemal
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 3387 - 3396
  • [4] BAAC: Bangor Arabic Annotated Corpus
    Alkhazi, Ibrahim S.
    Teahan, William J.
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2018, 9 (11) : 131 - 140
  • [5] A Morphologically Annotated Corpus of Emirati Arabic
    Khalifa, Salam
    Habash, Nizar
    Eryani, Fadhl
    Obeid, Ossama
    Abdulrahim, Dana
    Al Kaabi, Meera
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 3839 - 3846
  • [6] An Annotated Speech Corpus of Rare Dialect for Recognition-Take Dali Dialect as an Example
    Huang, Tian
    Yang, Dongqi
    Qin, Wanyun
    Zhang, Shubo
    Li, Binyang
    Li, Yan
    COGNITIVE COMPUTING, ICCC 2021, 2022, 12992 : 3 - 13
  • [7] A Morphologically Annotated Corpus and a Morphological Analyzer for Egyptian Arabic
    Fashwan, Amany
    Alansary, Sameh
    AI IN COMPUTATIONAL LINGUISTICS, 2021, 189 : 203 - 210
  • [8] The design, construction and evaluation of annotated Arabic cyberbullying corpus
    Fatima Shannag
    Bassam H. Hammo
    Hossam Faris
    Education and Information Technologies, 2022, 27 : 10977 - 11023
  • [9] The design, construction and evaluation of annotated Arabic cyberbullying corpus
    Shannag, Fatima
    Hammo, Bassam H.
    Faris, Hossam
    EDUCATION AND INFORMATION TECHNOLOGIES, 2022, 27 (08) : 10977 - 11023
  • [10] Arabic Speech Emotion Recognition From Saudi Dialect Corpus
    Aljuhani, Reem Hamed
    Alshutayri, Areej
    Alahdal, Shahd
    IEEE ACCESS, 2021, 9 : 127081 - 127085