A Large Scale Corpus of Gulf Arabic

被引:0
|
作者
Khalifa, Salam [1 ]
Habash, Nizar [1 ]
Abdulrahim, Dana [2 ]
Hassan, Sara [1 ]
机构
[1] New York Univ Abu Dhabi, Computat Approaches Modeling Language Lab, Abu Dhabi, U Arab Emirates
[2] Univ Bahrain, Zallaq, Bahrain
关键词
Arabic Dialects; Corpus; Large-Scale; Gulf Arabic;
D O I
暂无
中图分类号
H [语言、文字];
学科分类号
05 ;
摘要
Most Arabic natural language processing tools and resources are developed to serve Modern Standard Arabic (MSA), which is the official written language in the Arab World. Some Dialectal Arabic varieties, notably Egyptian Arabic, have received some attention lately and have a growing collection of resources that include annotated corpora and morphological analyzers and taggers. Gulf Arabic, however, lags behind in that respect. In this paper, we present the Gumar Corpus, a large-scale corpus of Gulf Arabic consisting of 110 million words from 1,200 forum novels. We annotate the corpus for sub-dialect information at the document level. We also present results of a preliminary study in the morphological annotation of Gulf Arabic which includes developing guidelines for a conventional orthography. The text of the corpus is publicly browsable through a web interface we developed for it.
引用
收藏
页码:4282 / 4289
页数:8
相关论文
共 50 条
  • [31] Cross-document coreference on a large scale corpus
    Gooi, CH
    Allan, J
    HLT-NAACL 2004: HUMAN LANGUAGE TECHNOLOGY CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE MAIN CONFERENCE, 2004, : 9 - 16
  • [32] A Large-Scale Query Spelling Correction Corpus
    Hagen, Matthias
    Potthast, Martin
    Gohsen, Marcel
    Rathgeber, Anja
    Stein, Benno
    SIGIR'17: PROCEEDINGS OF THE 40TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2017, : 1261 - 1264
  • [33] LARGE SCALE MARICULTURE OF SACCHARINA LATISSIMA IN THE GULF OF ALASKA
    Stekoll, Michael
    Lindell, Scott
    Yarish, Charles
    Goudey, Clifford
    Bailey, David
    Pryordegrees, Alf
    Manganelli, Domenic
    Perry, Beau
    Roberson, Loretta
    Decker, Julie
    Barbery, Kendall
    Dewhurst, Tobias
    PHYCOLOGIA, 2021, 60 : 18 - 18
  • [34] GULF ARABIC - HOLES,C
    KAYE, AS
    MODERN LANGUAGE JOURNAL, 1990, 74 (04): : 510 - 511
  • [35] GULF ARABIC - HOLES,C
    INGHAM, B
    BULLETIN OF THE SCHOOL OF ORIENTAL AND AFRICAN STUDIES-UNIVERSITY OF LONDON, 1991, 54 : 369 - 370
  • [36] AraSenCorpus: A Semi-Supervised Approach for Sentiment Annotation of a Large Arabic Text Corpus
    Al-Laith, Ali
    Shahbaz, Muhammad
    Alaskar, Hind F.
    Rehmat, Asim
    APPLIED SCIENCES-BASEL, 2021, 11 (05):
  • [37] A Large-Scale Leveled Readability Lexicon for Standard Arabic
    Al Khalil, Muhamed
    Habash, Nizar
    Jiang, Zhengyang
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 3053 - 3062
  • [38] Simplified guidelines for the creation of Large Scale Dialectal Arabic Annotations
    Elfardy, Heba
    Diab, Mona
    LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 371 - 378
  • [39] The Bahrain Corpus: A Multi-genre Corpus of Bahraini Arabic
    Abdulrahim, Dana
    Inoue, Go
    Shamsan, Latifa
    Khalifa, Salam
    Habash, Nizar
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 2345 - 2352
  • [40] Problems on large-scale speech corpus and the applications in TTS
    Zhang S.
    Liu L.
    Diao L.-H.
    Jisuanji Xuebao/Chinese Journal of Computers, 2010, 33 (04): : 687 - 696