Curras: an annotated corpus for the Palestinian Arabic dialect

被引:0
|
作者
Mustafa Jarrar
Nizar Habash
Faeq Alrimawi
Diyam Akra
Nasser Zalmout
机构
[1] Birzeit University,
[2] New York University Abu Dhabi,undefined
来源
关键词
Palestinian Arabic; Palestinian corpus; Arabic morphology; Conventional Orthography for Dialectal Arabic; Dialectal Arabic; Word annotation;
D O I
暂无
中图分类号
学科分类号
摘要
In this article we present Curras, the first morphologically annotated corpus of the Palestinian Arabic dialect. Palestinian Arabic is one of the many primarily spoken dialects of the Arabic language. Arabic dialects are generally under-resourced compared to Modern Standard Arabic, the primarily written and official form of Arabic. We start in the article with a background description that situates Palestinian Arabic linguistically and historically and compares it to Modern Standard Arabic and Egyptian Arabic in terms of phonological, morphological, orthographic, and lexical variations. We then describe the methodology we developed to collect Palestinian Arabic text to guarantee a variety of representative domains and genres. We also discuss the annotation process we used, which extended previous efforts for annotation guideline development, and utilized existing automatic annotation solutions for Standard Arabic and Egyptian Arabic. The annotation guidelines and annotation meta-data are described in detail. The Curras Palestinian Arabic corpus consists of more than 56 K tokens, which are annotated with rich morphological and lexical features. The inter-annotator agreement results indicate a high degree of consistency.
引用
收藏
页码:745 / 775
页数:30
相关论文
共 50 条
  • [41] Speech corpus for Medina dialect
    Khalafallah, Haneen Bahjat
    Fattah, Mohamed Abdel
    Abdulrahman, Ruqayya
    JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2024, 36 (02)
  • [42] The Norwegian Dialect Corpus Treebank
    Kasen, Andre
    Hagen, Kristin
    Noklestad, Anders
    Priestley, Joel
    Solberg, Per Erik
    Haug, Dag Trygve Truslew
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 4827 - 4832
  • [44] Arabic as One Language: Integrating Dialect in the Arabic Language Curriculum
    Allen, Roger
    AL-ARABIYYA-JOURNAL OF THE AMERICAN ASSOCIATION OF TEACHERS OF ARABIC, 2019, 52 : 159 - 162
  • [45] Arabic Corpus Linguistics
    Al-Surmi, Mansoor
    CORPORA, 2021, 16 (02) : 301 - 303
  • [46] Spoken Arabic Algerian Dialect Identification
    Bougrine, Soumia
    Cherroun, Hadda
    Abdelali, Ahmed
    2018 2ND INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE AND SPEECH PROCESSING (ICNLSP), 2018, : 96 - 101
  • [47] The Arabic Dialect of the Jews of Tripoli (Libya)
    Diem, Werner
    ZEITSCHRIFT DER DEUTSCHEN MORGENLANDISCHEN GESELLSCHAFT, 2008, 158 (02): : 438 - 441
  • [48] The Arabic dialect of the Cukurova (southern Turkey).
    Arnold, Werner
    ZEITSCHRIFT DER DEUTSCHEN MORGENLANDISCHEN GESELLSCHAFT, 2005, 155 (02): : 636 - 638
  • [49] On the Robustness of Arabic Speech Dialect Identification
    Sullivan, Peter
    Elmadany, AbdelRahim
    Abdul-Mageed, Muhammad
    INTERSPEECH 2023, 2023, : 5326 - 5330
  • [50] ARABIC DIALECT STUDIES - ARABIAN PENINSULA
    GOODISON, RAC
    MIDDLE EAST JOURNAL, 1958, 12 (02): : 205 - 213