A Comprehensive Analysis of Various Tokenizers for Arabic Large Language Models

被引:1
|
作者
Qarah, Faisal [1 ]
Alsanoosy, Tawfeeq [1 ]
机构
[1] Taibah Univ, Coll Comp Sci & Engn, Dept Comp Sci, Madinah 42353, Saudi Arabia
来源
APPLIED SCIENCES-BASEL | 2024年 / 14卷 / 13期
关键词
large language models; BERT; Arabic language; natural language processing; tokenizer; distributed computing; NLP applications;
D O I
10.3390/app14135696
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Pretrained language models have achieved great success in various natural language understanding (NLU) tasks due to their capacity to capture deep contextualized information in text using pretraining on large-scale corpora. Tokenization plays a significant role in the process of lexical analysis. Tokens become the input for other natural language processing (NLP) tasks, like semantic parsing and language modeling. However, there is a lack of research on the evaluation of the impact of tokenization on the Arabic language model. Therefore, this study aims to address this gap in the literature by evaluating the performance of various tokenizers on Arabic large language models (LLMs). In this paper, we analyze the differences between WordPiece, SentencePiece, and BBPE tokenizers by pretraining three BERT models using each tokenizer while measuring the performance of each model on seven different NLP tasks using 29 different datasets. Overall, the model pretrained with text tokenized using the SentencePiece tokenizer significantly outperforms the other two models that utilize WordPiece and BBPE tokenizers. The results of this paper will assist researchers in developing better models, making better decisions in selecting the best tokenizers, improving feature engineering, and making models more efficient, thus ultimately leading to advancements in various NLP applications.
引用
收藏
页数:17
相关论文
共 50 条
  • [1] Evaluating Various Tokenizers for Arabic Text Classification
    Zaid Alyafeai
    Maged S. Al-shaibani
    Mustafa Ghaleb
    Irfan Ahmad
    Neural Processing Letters, 2023, 55 : 2911 - 2933
  • [2] Evaluating Various Tokenizers for Arabic Text Classification
    Alyafeai, Zaid
    Al-shaibani, Maged S.
    Ghaleb, Mustafa
    Ahmad, Irfan
    NEURAL PROCESSING LETTERS, 2023, 55 (03) : 2911 - 2933
  • [3] Fine Tuning of large language Models for Arabic Language
    Tamer, Ahmed
    Hassan, Al-Amir
    Ali, Asmaa
    Salah, Nada
    Medhat, Walaa
    2023 20TH ACS/IEEE INTERNATIONAL CONFERENCE ON COMPUTER SYSTEMS AND APPLICATIONS, AICCSA, 2023,
  • [4] LAraBench: Benchmarking Arabic AI with Large Language Models
    Qatar Computing Research Institute, HBKU, Qatar
    不详
    arXiv, 1600,
  • [5] A comprehensive survey of large language models and multimodal large models in medicine
    Xiao, Hanguang
    Zhou, Feizhong
    Liu, Xingyue
    Liu, Tianqi
    Li, Zhipeng
    Liu, Xin
    Huang, Xiaoxuan
    INFORMATION FUSION, 2025, 117
  • [6] A Comprehensive Analysis of the Effectiveness of Large Language Models as Automatic Dialogue Evaluators
    Zhang, Chen
    D'Haro, Luis Fernando
    Chen, Yiming
    Zhang, Malu
    Li, Haizhou
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 17, 2024, : 19515 - 19524
  • [7] Medical foundation large language models for comprehensive text analysis and beyond
    Xie, Qianqian
    Chen, Qingyu
    Chen, Aokun
    Peng, Cheng
    Hu, Yan
    Lin, Fongci
    Peng, Xueqing
    Huang, Jimin
    Zhang, Jeffrey
    Keloth, Vipina
    Zhou, Xinyu
    Qian, Lingfei
    He, Huan
    Shung, Dennis
    Ohno-Machado, Lucila
    Wu, Yonghui
    Xu, Hua
    Bian, Jiang
    NPJ DIGITAL MEDICINE, 2025, 8 (01):
  • [8] Large Language Models on Graphs: A Comprehensive Survey
    Jin, Bowen
    Liu, Gang
    Han, Chi
    Jiang, Meng
    Ji, Heng
    Han, Jiawei
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (12) : 8622 - 8642
  • [9] LLMBox: A Comprehensive Library for Large Language Models
    Tang, Tianyi
    Hui, Yiwen
    Li, Bingqian
    Lu, Wenyang
    Qin, Zijing
    Sun, Haoxiang
    Wang, Jiapeng
    Xu, Shiyi
    Cheng, Xiaoxue
    Guo, Geyang
    Peng, Han
    Zheng, Bowen
    Tang, Yiru
    Min, Yingqian
    Chen, Yushuo
    Chen, Jie
    Zhao, Yuanqian
    Ding, Luran
    Wang, Yuhao
    Dong, Zican
    Xia, Chunxuan
    Li, Junyi
    Zhou, Kun
    Zhao, Wayne Xin
    Wen, Ji-Rong
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 3: SYSTEM DEMONSTRATIONS, 2024, : 388 - 399
  • [10] Large Language Models: A Comprehensive Guide for Radiologists
    Kim, Sunkyu
    Lee, Choong-kun
    Kim, Seung-seob
    JOURNAL OF THE KOREAN SOCIETY OF RADIOLOGY, 2024, 85 (05): : 861 - 882