Curras: an annotated corpus for the Palestinian Arabic dialect

被引:0
|
作者
Mustafa Jarrar
Nizar Habash
Faeq Alrimawi
Diyam Akra
Nasser Zalmout
机构
[1] Birzeit University,
[2] New York University Abu Dhabi,undefined
来源
关键词
Palestinian Arabic; Palestinian corpus; Arabic morphology; Conventional Orthography for Dialectal Arabic; Dialectal Arabic; Word annotation;
D O I
暂无
中图分类号
学科分类号
摘要
In this article we present Curras, the first morphologically annotated corpus of the Palestinian Arabic dialect. Palestinian Arabic is one of the many primarily spoken dialects of the Arabic language. Arabic dialects are generally under-resourced compared to Modern Standard Arabic, the primarily written and official form of Arabic. We start in the article with a background description that situates Palestinian Arabic linguistically and historically and compares it to Modern Standard Arabic and Egyptian Arabic in terms of phonological, morphological, orthographic, and lexical variations. We then describe the methodology we developed to collect Palestinian Arabic text to guarantee a variety of representative domains and genres. We also discuss the annotation process we used, which extended previous efforts for annotation guideline development, and utilized existing automatic annotation solutions for Standard Arabic and Egyptian Arabic. The annotation guidelines and annotation meta-data are described in detail. The Curras Palestinian Arabic corpus consists of more than 56 K tokens, which are annotated with rich morphological and lexical features. The inter-annotator agreement results indicate a high degree of consistency.
引用
收藏
页码:745 / 775
页数:30
相关论文
共 50 条
  • [21] Morphological distance between spoken Palestinian dialect and standard Arabic and its implications for reading acquisition
    Joubran-Awadie, Nancy
    Shalhoub-Awwad, Yasmin
    FIRST LANGUAGE, 2023, 43 (02) : 200 - 230
  • [22] The Tomsk Dialect Corpus: a comprehensively annotated database of a Siberian Russian dialect from material collected over the last 70 years
    Svetlana Zemicheva
    Maxim Gromov
    Ludmila Dubtsova
    Maria Ugryumova
    Anna Vasilchenko
    Natalia Zyuz’kova
    Russian Linguistics, 2023, 47 : 231 - 252
  • [23] The Tomsk Dialect Corpus: a comprehensively annotated database of a Siberian Russian dialect from material collected over the last 70 years
    Zemicheva, Svetlana
    Gromov, Maxim
    Dubtsova, Ludmila
    Ugryumova, Maria
    Vasilchenko, Anna
    Zyuz'kova, Natalia
    RUSSIAN LINGUISTICS, 2023, 47 (02) : 231 - 252
  • [24] Arabic Dialect Identification
    Zaidan, Omar F.
    Callison-Burch, Chris
    COMPUTATIONAL LINGUISTICS, 2014, 40 (01) : 171 - 202
  • [25] Open-Source Boundary-Annotated Corpus for Arabic Speech and Language Processing
    Brierley, Claire
    Sawalha, Majdi
    Atwell, Eric
    LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 1011 - 1016
  • [26] QASR: QCRI aljazeera speech resource a large scale annotated Arabic speech corpus
    Mubarak, Hamdy
    Hussein, Amir
    Chowdhury, Shammur Absar
    Ali, Ahmed
    ACL-IJCNLP 2021 - 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Proceedings of the Conference, 2021, : 2274 - 2285
  • [27] Building audio-visual phonetically annotated Arabic corpus for expressive text to speech
    Abdo, Omnia
    Abdou, Sherif
    Fashal, Mervat
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 3767 - 3771
  • [28] QASR: QCRI Aljazeera Speech Resource A Large Scale Annotated Arabic Speech Corpus
    Mubarak, Hamdy
    Hussein, Amir
    Chowdhury, Shammur Absar
    Ali, Ahmed
    59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1 (ACL-IJCNLP 2021), 2021, : 2274 - 2285
  • [29] Building a Tunisian Dialect into Arabic Language Parallel Corpus for a Phrase-based Machine Translation
    Sghaier, Mohamed Ali
    Zrigui, Mounir
    VISION 2025: EDUCATION EXCELLENCE AND MANAGEMENT OF INNOVATIONS THROUGH SUSTAINABLE ECONOMIC COMPETITIVE ADVANTAGE, 2019, : 2910 - 2921
  • [30] The Nordic Dialect Corpus
    Johannessen, Janne Bondi
    Priestley, Joel
    Hagen, Kristin
    Noklestad, Anders
    Lynum, Andre
    LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 3387 - 3391