Effect of Stemming on Text Similarity for Arabic Language at Sentence Level

被引:0
|
作者
Alhawarat M.O. [1 ]
Abdeljaber H. [1 ]
Hilal A. [2 ]
机构
[1] Department of Computer Science, College of Computer Engineering and Sciences, Prince Sattam Bin Abdulaziz University, Alkharj
[2] General Department, College of Preparatory Year, Prince Sattam Bin Abdulaziz University, Alkharj
关键词
Lemmatization; Machine learning; Natural language processing; Semantic text similarity; Stemming; TF-IDF; Word embedding;
D O I
10.7717/PEERJ-CS.530
中图分类号
学科分类号
摘要
Semantic Text Similarity (STS) has several and important applications in the field of Natural Language Processing (NLP). The Aim of this study is to investigate the effect of stemming on text similarity for Arabic language at sentence level. Several Arabic light and heavy stemmers as well as lemmatization algorithms are used in this study, with a total of 10 algorithms. Standard training and testing data sets are used from SemEval-2017 international workshop for Task 1, Track 1 Arabic (ar–ar). Different features are selected to study the effect of stemming on text similarity based on different similarity measures. Traditional machine learning algorithms are used such as Support Vector Machines (SVM), Stochastic Gradient Descent (SGD) and Naïve Bayesian (NB). Compared to the original text, using the stemmed and lemmatized documents in experiments achieve enhanced Pearson correlation results. The best results attained when using Arabic light Stemmer (ARLSTem) and Farasa light stemmers, Farasa and Qalsadi Lemmatizers and Tashaphyne heavy stemmer. The best enhancement was about 7.34% in Pearson correlation. In general, stemming considerably improves the performance of sentence text similarly for Arabic language. However, some stemmers make results worse than those for original text; they are Khoja heavy stemmer and AlKhalil light stemmer. Copyright 2021 Alhawarat et al.
引用
收藏
页码:1 / 18
页数:17
相关论文
共 50 条
  • [1] Effect of stemming on text similarity for Arabic language at sentence level
    Alhawarat, Mohammad O.
    Abdeljaber, Hikmat
    Hilal, Anwer
    PEERJ COMPUTER SCIENCE, 2021,
  • [2] The Effect of using Light Stemming for Arabic Text Classification
    Atwan, Jaffar
    Wedyan, Mohammad
    Bsoul, Qusay
    Hamadeen, Ahmad
    Alturki, Ryan
    Ikram, Mohammed
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2021, 12 (05) : 768 - 773
  • [3] The Effect of Stemming on Arabic Text Classification: An Empirical Study
    Wahbeh, Abdullah
    Al-Kabi, Mohammed
    Al-Radaideh, Qasem
    Al-Shawakfa, Emad
    Alsmadi, Izzat
    INTERNATIONAL JOURNAL OF INFORMATION RETRIEVAL RESEARCH, 2011, 1 (03) : 54 - 70
  • [4] Effect of ISRI Stemming on Similarity Measure for Arabic Document Clustering
    Bsoul, Qusay Walid
    Mohd, Masnizah
    INFORMATION RETRIEVAL TECHNOLOGY, 2011, 7097 : 584 - 593
  • [5] Impact of stemming on Arabic text summarization
    Alami, Nabil
    Meknassi, Mohammed
    Ouatik, Said Alaoui
    Ennahnahi, NourEddine
    2016 4TH IEEE INTERNATIONAL COLLOQUIUM ON INFORMATION SCIENCE AND TECHNOLOGY (CIST), 2016, : 338 - 343
  • [6] Stemming and similarity measures for Arabic Documents Clustering
    L.T.T.I, University Sidi Mohamed Ben Abdellah , Fez, Morocco
    不详
    不详
    Int. Symp. I/V Commun. Mob. Networks, ISIVC,
  • [7] Simple Stemming Rules for Arabic Language
    Soori, Hussein
    Platos, Jan
    Snasel, Vaclav
    PROCEEDING OF THE THIRD INTERNATIONAL CONFERENCE ON INTELLIGENT HUMAN COMPUTER INTERACTION (IHCI 2011), 2013, 179 : 99 - +
  • [8] Arabic Text Stemming: Comparative Analysis.
    Mamoun, Rasha
    Ahmed, Mahmoud
    2016 CONFERENCE OF BASIC SCIENCES AND ENGINEERING STUDIES (SCGAC), 2016, : 88 - 93
  • [9] Stemming versus light stemming as feature selection techniques for Arabic text categorization
    Duwairi, Rehab
    Al-Refai, Mohammad
    Khasawneh, Natheer
    2007 INNOVATIONS IN INFORMATION TECHNOLOGIES, VOLS 1 AND 2, 2007, : 199 - 203
  • [10] Comparative Analysis of Similarity Measures for Sentence Level Semantic Measurement of Text
    Saad, Sazianti Mohd
    Kamarudin, Siti Sakira
    2013 IEEE INTERNATIONAL CONFERENCE ON CONTROL SYSTEM, COMPUTING AND ENGINEERING (ICCSCE 2013), 2013, : 90 - +