A Large Scale Corpus of Gulf Arabic

被引:0
|
作者
Khalifa, Salam [1 ]
Habash, Nizar [1 ]
Abdulrahim, Dana [2 ]
Hassan, Sara [1 ]
机构
[1] New York Univ Abu Dhabi, Computat Approaches Modeling Language Lab, Abu Dhabi, U Arab Emirates
[2] Univ Bahrain, Zallaq, Bahrain
来源
LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION | 2016年
关键词
Arabic Dialects; Corpus; Large-Scale; Gulf Arabic;
D O I
暂无
中图分类号
H [语言、文字];
学科分类号
05 ;
摘要
Most Arabic natural language processing tools and resources are developed to serve Modern Standard Arabic (MSA), which is the official written language in the Arab World. Some Dialectal Arabic varieties, notably Egyptian Arabic, have received some attention lately and have a growing collection of resources that include annotated corpora and morphological analyzers and taggers. Gulf Arabic, however, lags behind in that respect. In this paper, we present the Gumar Corpus, a large-scale corpus of Gulf Arabic consisting of 110 million words from 1,200 forum novels. We annotate the corpus for sub-dialect information at the document level. We also present results of a preliminary study in the morphological annotation of Gulf Arabic which includes developing guidelines for a conventional orthography. The text of the corpus is publicly browsable through a web interface we developed for it.
引用
收藏
页码:4282 / 4289
页数:8
相关论文
共 50 条
  • [11] A Large Scale Speech Sentiment Corpus
    Chen, Eric Y.
    Lu, Zhiyun
    Xu, Hao
    Cao, Liangliang
    Zhang, Yu
    Fan, James
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 6549 - 6555
  • [12] Tharwa: A Large Scale Dialectal Arabic - Standard Arabic - English Lexicon
    Diab, Mona
    Al-Badrashiny, Mohamed
    Aminian, Maryam
    Attia, Mohammed
    Dasigi, Pradeep
    Elfardy, Heba
    Eskander, Ramy
    Habash, Nizar
    Hawwari, Abdelati
    Salloum, Wael
    LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 3782 - 3789
  • [13] Impact of morphological analysis and a large training corpus on the performances of Arabic diacritization
    Chennoufi, Amine
    Mazroui, Azzeddine
    INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2016, 19 (02) : 269 - 280
  • [14] Arabic Corpus Linguistics
    Al-Surmi, Mansoor
    CORPORA, 2021, 16 (02) : 301 - 303
  • [15] ARABIC OR PERSIAN GULF - POLITICAL AND ECONOMIC PROBLEMS OF ARABIC GULF STATES
    BURGELIN, H
    EUROPA ARCHIV, 1974, 29 (19): : 665 - 674
  • [16] A 700M+Arabic corpus: KACST Arabic corpus design and construction
    Al-Thubaity, Abdulmohsen O.
    LANGUAGE RESOURCES AND EVALUATION, 2015, 49 (03) : 721 - 751
  • [17] A Large-Scale Corpus for Conversation Disentanglement
    Kummerfeld, Jonathan K.
    Athreya, Vignesh
    Patel, Siva Sankalp
    Gouravajhala, Sai R.
    Gunasekara, Chulaka
    Polymenakos, Lazaros
    Peper, Joseph J.
    Ganhotra, Jatin
    Lasecki, Walter S.
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 3846 - 3856
  • [18] DIDISPEECH: A LARGE SCALE MANDARIN SPEECH CORPUS
    Guo, Tingwei
    Wen, Cheng
    Jiang, Dongwei
    Luo, Ne
    Zhang, Ruixiong
    Zhao, Shuaijiang
    Li, Wubo
    Gong, Cheng
    Zou, Wei
    Han, Kun
    Li, Xiangang
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6968 - 6972
  • [19] A Corpus for Large-Scale Phonetic Typology
    Salesky, Elizabeth
    Chodroff, Eleanor
    Pimentel, Tiago
    Wiesner, Matthew
    Cotterell, Ryan
    Black, Alan W.
    Eisner, Jason
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 4526 - 4546
  • [20] A 700M+ Arabic corpus: KACST Arabic corpus design and construction
    Abdulmohsen O. Al-Thubaity
    Language Resources and Evaluation, 2015, 49 : 721 - 751