A Comprehensive Analysis of Various Tokenizers for Arabic Large Language Models

被引:1
|
作者
Qarah, Faisal [1 ]
Alsanoosy, Tawfeeq [1 ]
机构
[1] Taibah Univ, Coll Comp Sci & Engn, Dept Comp Sci, Madinah 42353, Saudi Arabia
来源
APPLIED SCIENCES-BASEL | 2024年 / 14卷 / 13期
关键词
large language models; BERT; Arabic language; natural language processing; tokenizer; distributed computing; NLP applications;
D O I
10.3390/app14135696
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Pretrained language models have achieved great success in various natural language understanding (NLU) tasks due to their capacity to capture deep contextualized information in text using pretraining on large-scale corpora. Tokenization plays a significant role in the process of lexical analysis. Tokens become the input for other natural language processing (NLP) tasks, like semantic parsing and language modeling. However, there is a lack of research on the evaluation of the impact of tokenization on the Arabic language model. Therefore, this study aims to address this gap in the literature by evaluating the performance of various tokenizers on Arabic large language models (LLMs). In this paper, we analyze the differences between WordPiece, SentencePiece, and BBPE tokenizers by pretraining three BERT models using each tokenizer while measuring the performance of each model on seven different NLP tasks using 29 different datasets. Overall, the model pretrained with text tokenized using the SentencePiece tokenizer significantly outperforms the other two models that utilize WordPiece and BBPE tokenizers. The results of this paper will assist researchers in developing better models, making better decisions in selecting the best tokenizers, improving feature engineering, and making models more efficient, thus ultimately leading to advancements in various NLP applications.
引用
收藏
页数:17
相关论文
共 50 条
  • [21] A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment
    Wu, Tianhe
    Ma, Kede
    Liang, Jie
    Yang, Yujiu
    Zhang, Lei
    COMPUTER VISION - ECCV 2024, PT LXXIV, 2025, 15132 : 143 - 160
  • [22] A Comprehensive Evaluation of Large Language Models for Turkish Abstractive Dialogue Summarization
    Buyuk, Osman
    IEEE ACCESS, 2024, 12 : 124391 - 124401
  • [23] TIMEBENCH: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models
    Chu, Zheng
    Chen, Jingchang
    Chen, Qianglong
    Yu, Weijiang
    Wang, Haotian
    Liu, Ming
    Qin, Bing
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 1204 - 1228
  • [24] A comprehensive review of large language models: issues and solutions in learning environments
    Shahzad, Tariq
    Mazhar, Tehseen
    Tariq, Muhammad Usman
    Ahmad, Wasim
    Ouahada, Khmaies
    Hamam, Habib
    DISCOVER SUSTAINABILITY, 2025, 6 (01):
  • [25] Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators
    Chen, Liang
    Deng, Yang
    Bian, Yatao
    Qin, Zeyu
    Wu, Bingzhe
    Chua, Tat-Seng
    Wong, Kam-Fai
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 6325 - 6341
  • [26] Comprehensive testing of large language models for extraction of structured data in pathology
    Bastian Grothey
    Jan Odenkirchen
    Adnan Brkic
    Birgid Schömig-Markiefka
    Alexander Quaas
    Reinhard Büttner
    Yuri Tolkach
    Communications Medicine, 5 (1):
  • [27] Unlocking the Black Box? A Comprehensive Exploration of Large Language Models in Rehabilitation
    Bonnechere, Bruno
    AMERICAN JOURNAL OF PHYSICAL MEDICINE & REHABILITATION, 2024, 103 (06) : 532 - 537
  • [28] Hate Speech Detection Using Large Language Models: A Comprehensive Review
    Albladi, Aish
    Islam, Minarul
    Das, Amit
    Bigonah, Maryam
    Zhang, Zheng
    Jamshidi, Fatemeh
    Rahgouy, Mostafa
    Raychawdhary, Nilanjana
    Marghitu, Daniela
    Seals, Cheryl
    IEEE ACCESS, 2025, 13 : 20871 - 20892
  • [29] A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models
    Xu, Zihao
    Liu, Yi
    Deng, Gelei
    Li, Yuekang
    Picek, Stjepan
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 7432 - 7449
  • [30] Trend Analysis Through Large Language Models
    Alzapiedi, Lucas
    Bihl, Trevor
    IEEE NATIONAL AEROSPACE AND ELECTRONICS CONFERENCE, NAECON 2024, 2024, : 370 - 374