A Comprehensive Analysis of Various Tokenizers for Arabic Large Language Models

被引：1

作者：

Qarah, Faisal ^{[1
]}

Alsanoosy, Tawfeeq ^{[1
]}

机构：

[1] Taibah Univ, Coll Comp Sci & Engn, Dept Comp Sci, Madinah 42353, Saudi Arabia

来源：

APPLIED SCIENCES-BASEL | 2024年 / 14卷 / 13期

关键词：

large language models; BERT; Arabic language; natural language processing; tokenizer; distributed computing; NLP applications;

D O I：

10.3390/app14135696

中图分类号：

O6 [化学];

学科分类号：

0703 ;

摘要：

Pretrained language models have achieved great success in various natural language understanding (NLU) tasks due to their capacity to capture deep contextualized information in text using pretraining on large-scale corpora. Tokenization plays a significant role in the process of lexical analysis. Tokens become the input for other natural language processing (NLP) tasks, like semantic parsing and language modeling. However, there is a lack of research on the evaluation of the impact of tokenization on the Arabic language model. Therefore, this study aims to address this gap in the literature by evaluating the performance of various tokenizers on Arabic large language models (LLMs). In this paper, we analyze the differences between WordPiece, SentencePiece, and BBPE tokenizers by pretraining three BERT models using each tokenizer while measuring the performance of each model on seven different NLP tasks using 29 different datasets. Overall, the model pretrained with text tokenized using the SentencePiece tokenizer significantly outperforms the other two models that utilize WordPiece and BBPE tokenizers. The results of this paper will assist researchers in developing better models, making better decisions in selecting the best tokenizers, improving feature engineering, and making models more efficient, thus ultimately leading to advancements in various NLP applications.

引用

页数：17

共 50 条

[1] Evaluating Various Tokenizers for Arabic Text Classification
Zaid Alyafeai
Maged S. Al-shaibani
Mustafa Ghaleb
Irfan Ahmad
Neural Processing Letters, 2023, 55 : 2911 - 2933
[2] Evaluating Various Tokenizers for Arabic Text Classification
Alyafeai, Zaid
Al-shaibani, Maged S.
Ghaleb, Mustafa
Ahmad, Irfan
NEURAL PROCESSING LETTERS, 2023, 55 (03) : 2911 - 2933
[3] Fine Tuning of large language Models for Arabic Language
Tamer, Ahmed
Hassan, Al-Amir
Ali, Asmaa
Salah, Nada
Medhat, Walaa
2023 20TH ACS/IEEE INTERNATIONAL CONFERENCE ON COMPUTER SYSTEMS AND APPLICATIONS, AICCSA, 2023,
[4] LAraBench: Benchmarking Arabic AI with Large Language Models
Qatar Computing Research Institute, HBKU, Qatar
不详
arXiv, 1600,
[5] A comprehensive survey of large language models and multimodal large models in medicine
Xiao, Hanguang
Zhou, Feizhong
Liu, Xingyue
Liu, Tianqi
Li, Zhipeng
Liu, Xin
Huang, Xiaoxuan
INFORMATION FUSION, 2025, 117
[6] A Comprehensive Analysis of the Effectiveness of Large Language Models as Automatic Dialogue Evaluators
Zhang, Chen
D'Haro, Luis Fernando
Chen, Yiming
Zhang, Malu
Li, Haizhou
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 17, 2024, : 19515 - 19524
[7] Medical foundation large language models for comprehensive text analysis and beyond
Xie, Qianqian
Chen, Qingyu
Chen, Aokun
Peng, Cheng
Hu, Yan
Lin, Fongci
Peng, Xueqing
Huang, Jimin
Zhang, Jeffrey
Keloth, Vipina
Zhou, Xinyu
Qian, Lingfei
He, Huan
Shung, Dennis
Ohno-Machado, Lucila
Wu, Yonghui
Xu, Hua
Bian, Jiang
NPJ DIGITAL MEDICINE, 2025, 8 (01):
[8] Large Language Models on Graphs: A Comprehensive Survey
Jin, Bowen
Liu, Gang
Han, Chi
Jiang, Meng
Ji, Heng
Han, Jiawei
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (12) : 8622 - 8642
[9] LLMBox: A Comprehensive Library for Large Language Models
Tang, Tianyi
Hui, Yiwen
Li, Bingqian
Lu, Wenyang
Qin, Zijing
Sun, Haoxiang
Wang, Jiapeng
Xu, Shiyi
Cheng, Xiaoxue
Guo, Geyang
Peng, Han
Zheng, Bowen
Tang, Yiru
Min, Yingqian
Chen, Yushuo
Chen, Jie
Zhao, Yuanqian
Ding, Luran
Wang, Yuhao
Dong, Zican
Xia, Chunxuan
Li, Junyi
Zhou, Kun
Zhao, Wayne Xin
Wen, Ji-Rong
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 3: SYSTEM DEMONSTRATIONS, 2024, : 388 - 399
[10] Large Language Models: A Comprehensive Guide for Radiologists
Kim, Sunkyu
Lee, Choong-kun
Kim, Seung-seob
JOURNAL OF THE KOREAN SOCIETY OF RADIOLOGY, 2024, 85 (05): : 861 - 882

← 1 2 3 4 5 →