Arabic Dialect Identification

被引:105
|
作者
Zaidan, Omar F. [1 ]
Callison-Burch, Chris [2 ]
机构
[1] Microsoft Res, Seattle, WA USA
[2] Univ Penn, Comp & Informat Sci Dept, Philadelphia, PA 19104 USA
关键词
LANGUAGE IDENTIFICATION; AGREEMENT;
D O I
10.1162/COLI_a_00169
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The written form of the Arabic language, Modern Standard Arabic (MSA), differs in a non-trivial manner from the various spoken regional dialects of Arabicthe true native languages of Arabic speakers. Those dialects, in turn, differ quite a bit from each other. However, due to MSA's prevalence in written form, almost all Arabic data sets have predominantly MSA content. In this article, we describe the creation of a novel Arabic resource with dialect annotations. We have created a large monolingual data set rich in dialectal Arabic content called the Arabic On-line Commentary Data set (Zaidan and Callison-Burch 2011). We describe our annotation effort to identify the dialect level (and dialect itself) in each of more than 100,000 sentences from the data set by crowdsourcing the annotation task, and delve into interesting annotator behaviors (like over-identification of one's own dialect). Using this new annotated data set, we consider the task of Arabic dialect identification: Given the word sequence forming an Arabic sentence, determine the variety of Arabic in which it is written. We use the data to train and evaluate automatic classifiers for dialect identification, and establish that classifiers using dialectal data significantly and dramatically outperform baselines that use MSA-only data, achieving near-human classification accuracy. Finally, we apply our classifiers to discover dialectical data from a large Web crawl consisting of 3.5 million pages mined from on-line Arabic newspapers.
引用
收藏
页码:171 / 202
页数:32
相关论文
共 50 条
  • [21] FACTOR ANALYSIS-BASED INFORMATION INTEGRATION FOR ARABIC DIALECT IDENTIFICATION
    Lei, Yun
    Hansen, John H. L.
    2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS, 2009, : 4337 - 4340
  • [22] Audio-Textual Arabic Dialect Identification for Opinion Mining Videos
    Al-Azani, Sadam
    E-Alfyt, El-Sayed M.
    2019 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (IEEE SSCI 2019), 2019, : 2470 - 2475
  • [23] Arabic Dialect Identification with Deep Learning and Hybrid Frequency Based Features
    Fares, Youssef
    El-Zanaty, Zeyad
    Abdel-Salam, Kareem
    Ezzeldin, Muhammed
    Mohamed, Aliaa
    El-Awaad, Karim
    Torki, Marwan
    FOURTH ARABIC NATURAL LANGUAGE PROCESSING WORKSHOP (WANLP 2019), 2019, : 224 - 228
  • [24] The MADAR Shared Task on Arabic Fine-Grained Dialect Identification
    Bouamor, Houda
    Hassan, Sabit
    Habash, Nizar
    FOURTH ARABIC NATURAL LANGUAGE PROCESSING WORKSHOP (WANLP 2019), 2019, : 199 - 207
  • [25] Evaluating the Influence of Parallel Corpora on Arabic Dialect Identification: A Comparative Study
    Lichouri, Mohamed
    Lounnas, Khaled
    Abbas, Mourad
    PROGRAM OF THE 2ND INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING AND AUTOMATIC CONTROL, ICEEAC 2024, 2024,
  • [26] JHU System Description for the MADAR Arabic Dialect Identification Shared Task
    Lippincott, Tom
    Shapiro, Pamela
    Duh, Kevin
    McNamee, Paul
    FOURTH ARABIC NATURAL LANGUAGE PROCESSING WORKSHOP (WANLP 2019), 2019, : 264 - 268
  • [27] Arabic dialect geography: an introduction
    Watson, Janet C. E.
    ZEITSCHRIFT DER DEUTSCHEN MORGENLANDISCHEN GESELLSCHAFT, 2009, 159 (01): : 165 - 168
  • [28] Nabiha: An Arabic Dialect Chatbot
    Al-Ghadhban, Dana
    Al-Twairesh, Nora
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2020, 11 (03) : 452 - 459
  • [29] The Arabic dialect of Nabk (Syria)
    Kaye, Alan S.
    JOURNAL OF NEAR EASTERN STUDIES, 2011, 70 (01) : 165 - 165
  • [30] Arabic dialect geography: An introduction
    Kaye, Alan S.
    JOURNAL OF THE AMERICAN ORIENTAL SOCIETY, 2006, 126 (02) : 282 - 283