Effect of Stemming on Text Similarity for Arabic Language at Sentence Level

被引:0
|
作者
Alhawarat M.O. [1 ]
Abdeljaber H. [1 ]
Hilal A. [2 ]
机构
[1] Department of Computer Science, College of Computer Engineering and Sciences, Prince Sattam Bin Abdulaziz University, Alkharj
[2] General Department, College of Preparatory Year, Prince Sattam Bin Abdulaziz University, Alkharj
关键词
Lemmatization; Machine learning; Natural language processing; Semantic text similarity; Stemming; TF-IDF; Word embedding;
D O I
10.7717/PEERJ-CS.530
中图分类号
学科分类号
摘要
Semantic Text Similarity (STS) has several and important applications in the field of Natural Language Processing (NLP). The Aim of this study is to investigate the effect of stemming on text similarity for Arabic language at sentence level. Several Arabic light and heavy stemmers as well as lemmatization algorithms are used in this study, with a total of 10 algorithms. Standard training and testing data sets are used from SemEval-2017 international workshop for Task 1, Track 1 Arabic (ar–ar). Different features are selected to study the effect of stemming on text similarity based on different similarity measures. Traditional machine learning algorithms are used such as Support Vector Machines (SVM), Stochastic Gradient Descent (SGD) and Naïve Bayesian (NB). Compared to the original text, using the stemmed and lemmatized documents in experiments achieve enhanced Pearson correlation results. The best results attained when using Arabic light Stemmer (ARLSTem) and Farasa light stemmers, Farasa and Qalsadi Lemmatizers and Tashaphyne heavy stemmer. The best enhancement was about 7.34% in Pearson correlation. In general, stemming considerably improves the performance of sentence text similarly for Arabic language. However, some stemmers make results worse than those for original text; they are Khoja heavy stemmer and AlKhalil light stemmer. Copyright 2021 Alhawarat et al.
引用
收藏
页码:1 / 18
页数:17
相关论文
共 50 条
  • [21] The Order of Sentence Elements at Complex Sentence Level and Text Level
    Badurina, Lada
    RASPRAVE, 2013, 39 (02): : 299 - 310
  • [22] Contextual Text Categorization: An Improved Stemming Algorithm to Increase the Quality of Categorization in Arabic Text
    Gadri, Said
    Moussaoui, Abdelouahab
    INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY, 2017, 14 (06) : 835 - 841
  • [23] Low-level natural language technique for arabic text processing
    Awajan, A
    COMPUTERS AND THEIR APPLICATIONS, 2001, : 387 - 390
  • [24] A Chinese Short Text Similarity Method Integrating Sentence-Level and Phrase-Level Semantics
    Shen, Zhenji
    Xiao, Zhiyong
    ELECTRONICS, 2024, 13 (24):
  • [25] Word-Level vs Sentence-Level Language Identification: Application to Algerian and Arabic Dialects
    Lichouri, Mohamed
    Abbas, Mourad
    Freihat, Abed Alhakim
    Megtouf, Dhiya El Hak
    ARABIC COMPUTATIONAL LINGUISTICS, 2018, 142 : 246 - 253
  • [26] Language shift and sentence processing in Moroccan Arabic
    ElAissati, A
    LANGUAGE CHOICES: CONDITIONS, CONSTRAINTS, AND CONSEQUENCES, 1997, 1 : 77 - 90
  • [27] Addressing Stemming Algorithm for Arabic Text Using Spark Over Hadoop
    Bougar, Marieme
    Ziyati, El Houssaine
    ADVANCED INTELLIGENT SYSTEMS FOR SUSTAINABLE DEVELOPMENT (AI2SD'2019): VOL 1 - ADVANCED INTELLIGENT SYSTEMS FOR EDUCATION AND INTELLIGENT LEARNING SYSTEM, 2020, 1102 : 74 - 82
  • [28] Effect of Stemming on Hindi Text Classification
    Pimpalshende, Anjusha
    Singh, Preety
    Potnurwar, Archana
    INTERNATIONAL JOURNAL OF NEXT-GENERATION COMPUTING, 2023, 14 (01): : 208 - 215
  • [29] Sentence Similarity Detection in Malayalam Language using cosine similarity
    Gokul, P. P.
    Akhil, B. K.
    Kumar, Shiva K. M.
    2017 2ND IEEE INTERNATIONAL CONFERENCE ON RECENT TRENDS IN ELECTRONICS, INFORMATION & COMMUNICATION TECHNOLOGY (RTEICT), 2017, : 221 - 225
  • [30] The Effect of Stemming and Stop-Word-Removal on Automatic Text Classification in Turkish Language
    Cagatayli, Mustafa
    Celebi, Erbug
    NEURAL INFORMATION PROCESSING, PT I, 2015, 9489 : 168 - 176