Advancing Arabic Word Embeddings: A Multi-Corpora Approach with Optimized Hyperparameters and Custom Evaluation

被引:0
|
作者
Allahim, Azzah [1 ,2 ]
Cherif, Asma [1 ,3 ]
机构
[1] King Abdulaziz Univ, Fac Comp & Informat Technol, IT Dept, Jeddah 21589, Saudi Arabia
[2] Jouf Univ, Coll Comp & Informat Sci, Sakaka 72388, Saudi Arabia
[3] King Abdulaziz Univ, Ctr Excellence Smart Environm Res, Jeddah 21589, Saudi Arabia
来源
APPLIED SCIENCES-BASEL | 2024年 / 14卷 / 23期
关键词
word embedding; Word2vec; FastText; Arabic embedding; Arabic corpus;
D O I
10.3390/app142311104
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
The expanding Arabic user base presents a unique opportunity for researchers to tap into vast online Arabic resources. However, the lack of reliable Arabic word embedding models and the limited availability of Arabic corpora poses significant challenges. This paper addresses these gaps by developing and evaluating Arabic word embedding models trained on diverse Arabic corpora, investigating how varying hyperparameter values impact model performance across different NLP tasks. To train our models, we collected data from three distinct sources: Wikipedia, newspapers, and 32 Arabic books, each selected to capture specific linguistic and contextual features of Arabic. By using advanced techniques such as Word2Vec and FastText, we experimented with different hyperparameter configurations, such as vector size, window size, and training algorithms (CBOW and skip-gram), to analyze their impact on model quality. Our models were evaluated using a range of NLP tasks, including sentiment analysis, similarity tests, and an adapted analogy test designed specifically for Arabic. The findings revealed that both the corpus size and hyperparameter settings had notable effects on performance. For instance, in the analogy test, a larger vocabulary size significantly improved outcomes, with the FastText skip-gram models excelling in accurately solving analogy questions. For sentiment analysis, vocabulary size was critical, while in similarity scoring, the FastText models achieved the highest scores, particularly with smaller window and vector sizes. Overall, our models demonstrated strong performance, achieving 99% and 90% accuracies in sentiment analysis and the analogy test, respectively, along with a similarity score of 8 out of 10. These results underscore the value of our models as a robust tool for Arabic NLP research, addressing a pressing need for high-quality Arabic word embeddings.
引用
收藏
页数:22
相关论文
共 19 条
  • [1] Methodical Evaluation of Arabic Word Embeddings
    Elrazzaz, Mohammed
    Elbassuoni, Shady
    Shaban, Khaled
    Helwe, Chadi
    PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 2, 2017, : 454 - 458
  • [2] Evaluation of Stacked Embeddings for Arabic Word Sense Disambiguation
    Laatar, Rim
    Aloulou, Chafik
    Belguith, Lamia Hadrich
    COMPUTACION Y SISTEMAS, 2023, 27 (02): : 379 - 388
  • [3] A hybrid Approach for Arabic Multi-Word Term Extraction
    Bounhas, Ibrahim
    Slimani, Yahya
    IEEE NLP-KE 2009: PROCEEDINGS OF INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING, 2009, : 429 - 436
  • [4] Acquisition of Arabic word formation - A multi-path approach
    Badry, Fatima
    PERSPECTIVES ON ARABIC LINGUISTICS XVII-XVIII, 2005, 267 : 243 - 272
  • [5] A Neural Word Embeddings Approach for Multi-Domain Sentiment Analysis
    Dragoni, Mauro
    Petrucci, Giulio
    IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2017, 8 (04) : 457 - 470
  • [6] Advancing language models through domain knowledge integration: a comprehensive approach to training, evaluation, and optimization of social scientific neural word embeddings
    Stoehr, Fabian
    JOURNAL OF COMPUTATIONAL SOCIAL SCIENCE, 2024, 7 (02): : 1753 - 1793
  • [7] A Contrastive Approach to Multi-word Term Extraction from Domain Corpora
    Bonin, Francesca
    Dell'Orletta, Felice
    Venturi, Giulia
    Montemagni, Simonetta
    LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010,
  • [8] Retrieving Multi-Entity Associations: An Evaluation of Combination Modes for Word Embeddings
    Feher, Gloria
    Spitz, Andreas
    Gertz, Michael
    PROCEEDINGS OF THE 42ND INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '19), 2019, : 1169 - 1172
  • [9] Classification of Arabic healthcare questions based on word embeddings learned from massive consultations: a deep learning approach
    Faris, Hossam
    Habib, Maria
    Faris, Mohammad
    Alomari, Alaa
    Castillo, Pedro A.
    Alomari, Manal
    JOURNAL OF AMBIENT INTELLIGENCE AND HUMANIZED COMPUTING, 2022, 13 (04) : 1811 - 1827
  • [10] Classification of Arabic healthcare questions based on word embeddings learned from massive consultations: a deep learning approach
    Hossam Faris
    Maria Habib
    Mohammad Faris
    Alaa Alomari
    Pedro A. Castillo
    Manal Alomari
    Journal of Ambient Intelligence and Humanized Computing, 2022, 13 : 1811 - 1827