Efficient discrimination between arabic dialects

被引:0
|
作者
Bessou S. [1 ]
Sari R. [1 ]
机构
[1] Department of Computer Science, Faculty of Sciences, University of Ferhat, Abbas Sétif 1, Sétif
关键词
Arabic; Computational linguistics; Dialects identification; Logistic regression; Machine learning; Social media;
D O I
10.2174/2213275912666190716115604
中图分类号
学科分类号
摘要
Background: With the explosion of communication technologies and the accompanying pervasive use of social media, we notice an outstanding proliferation of posts, reviews, comments, and other forms of expressions in different languages. This content attracted researchers from different fields; economics, political sciences, social sciences, psychology and particularly language processing. One of the prominent subjects is the discrimination between similar languages and dialects using natural language processing and machine learning techniques. The problem is usually addressed by formulating the identification as a classification task. Methods: The approach is based on machine learning classification methods to discriminate between Modern Standard Arabic (MSA) and four regional Arabic dialects: Egyptian, Levantine, Gulf and North-African. Several models were trained to discriminate between the studied dialects in large corpora mined from online Arabic newspapers and manually annotated. Results: Experimental results showed that n-gram features could substantially improve performance. Logistic regression based on character and word n-gram model using Count Vectors identified the handled dialects with an overall accuracy of 95%. Best results were achieved with Linear Support vector classifier using TF-IDF Vectors trained by character-based uni-gram, bi-gram, tri-gram, and word-based uni-gram, bi-gram with an overall accuracy of 95.1%. Conclusion: The results showed that n-gram features could substantially improve performance. Additionally, we noticed that the kind of data representation could provide a significant performance boost compared to simple representation. © 2020 Bentham Science Publishers.
引用
收藏
页码:725 / 730
页数:5
相关论文
共 50 条
  • [1] Similarities between Arabic dialects: Investigating geographical proximity
    Alsudais, Abdulkareem
    Alotaibi, Wafa
    Alomary, Faye
    INFORMATION PROCESSING & MANAGEMENT, 2022, 59 (01)
  • [2] TEACHING THE DIALECTS IN ARABIC
    IRVING, TB
    MODERN LANGUAGE JOURNAL, 1960, 44 (07): : 313 - 314
  • [3] An introduction to the Arabic dialects
    Testen, D
    JOURNAL OF NEAR EASTERN STUDIES, 1999, 58 (03) : 231 - 232
  • [4] Consonantism in Arabic Dialects
    Ould Mohamed Baba, Ahmed Salem
    ANAQUEL DE ESTUDIOS ARABES, 2008, 19 : 141 - 158
  • [5] USE OF EUPHEMISM IN ARABIC DIALECTS
    SAADA, L
    KOLNER ZEITSCHRIFT FUR SOZIOLOGIE UND SOZIALPSYCHOLOGIE, 1971, (15): : 336 - 347
  • [6] Automatic Identification of Arabic Dialects
    Belgacem, Mohamed
    Antoniadis, Georges
    Besacier, Laurent
    LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010, : 3437 - 3440
  • [7] VARIATION IN THE MORPHOPHONOLOGY OF ARABIC DIALECTS
    HOLES, C
    TRANSACTIONS OF THE PHILOLOGICAL SOCIETY, 1986, : 167 - 190
  • [8] An introduction to the Arabic dialects.
    Bettini, L
    BULLETIN OF THE SCHOOL OF ORIENTAL AND AFRICAN STUDIES-UNIVERSITY OF LONDON, 1997, 60 : 617 - 617
  • [9] Spoken Arabic Dialects Identification: The Case of Egyptian and Jordanian Dialects
    Al-Ayyoub, Mahmoud
    Rihani, Marwan K.
    Dalgamoni, Nidal I.
    Abdulla, Nawaf A.
    2014 5TH INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION SYSTEMS (ICICS), 2014,
  • [10] Rhythmic Features across Modern Standard Arabic and Arabic Dialects
    Droua-Hamdani, Ghania
    Alotaibi, Yousef A.
    Selouani, Sid-Ahmed
    Boudraa, Malika
    LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014,