isiZulu Word Embeddings

被引：0

作者：

Dlamini, Sibonelo ^{[1
]}

Jembere, Edgar ^{[1
]}

Pillay, Anban ^{[1
]}

van Niekerk, Brett ^{[1
]}

机构：

[1] Univ KwaZulu Natal, Dept Comp Sci, Durban, South Africa

来源：

2021 CONFERENCE ON INFORMATION COMMUNICATIONS TECHNOLOGY AND SOCIETY (ICTAS) | 2021年

关键词：

isiZulu; word embeddings; semantic relatedness; agglutinative language; subword embeddings;

D O I：

10.1109/ICTAS50802.2021.9395011

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Word embeddings are currently the most popular vector space model in Natural Language Processing. How we encode words is important because it affects the performance of many downstream tasks such as Machine Translation (MT), Information Retrieval (IR) and Automatic Speech Recognition (ASR). While much focus has been placed on constructing word embeddings for English, very little attention is paid to under-resourced languages, especially native African languages. In this paper we select four popular word embedding models (Word2Vec CBOW and Skip-Gram; FastText and GloVe) and train them on the 10 million token isiZulu National Corpus (INC) to create isiZulu word embeddings. To the best of our knowledge, this is the first time that word embeddings in isiZulu have been constructed and made available to the public. We create a semantic similarity data set analogous to WordSim353, which we also make publicly available. This data set is used to conduct an evaluation of the four models to determine which is the best for creating isiZulu word embeddings in a low-resource (small corpus) setting. We found that the Word2Vec Skip-Gram model produced the highest quality embeddings, as measured by this semantic similarity task. However, it was the GloVe model which performed best on the nearest neighbours task.

引用

页码：121 / 126

页数：6

共 50 条

[41] Distributed Negative Sampling for Word Embeddings
Stergiou, Stergios
Straznickas, Zygimantas
Wu, Rolina
Tsioutsiouliklis, Kostas
THIRTY-FIRST AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 2569 - 2575
[42] Turkish entity discovery with word embeddings
Kalender, Murat
Korkmaz, Emin Erkan
TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES, 2017, 25 (03) : 2388 - 2398
[43] Cross-Lingual Word Embeddings
Søgaard A.
Vulić I.
Ruder S.
Faruqui M.
Synthesis Lectures on Human Language Technologies, 2019, 12 (02): : 1 - 132
[44] Joint Multiclass Debiasing of Word Embeddings
Popovic, Radomir
Lemmerich, Florian
Strohmaier, Markus
FOUNDATIONS OF INTELLIGENT SYSTEMS (ISMIS 2020), 2020, 12117 : 79 - 89
[45] A Systematic Literature Review on Word Embeddings
Gutierrez, Luis
Keith, Brian
TRENDS AND APPLICATIONS IN SOFTWARE ENGINEERING (CIMPS 2018), 2019, 865 : 132 - 141
[46] Joint Learning of Character and Word Embeddings
Chen, Xinxiong
Xu, Lei
Liu, Zhiyuan
Sun, Maosong
Luan, Huanbo
PROCEEDINGS OF THE TWENTY-FOURTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE (IJCAI), 2015, : 1236 - 1242
[47] Relation Reconstructive Binarization of word embeddings
Feiyang Pan
Shuokai Li
Xiang Ao
Qing He
Frontiers of Computer Science, 2022, 16
[48] Invariance and identifiability issues for word embeddings
Carrington, Rachel
Bharath, Karthik
Preston, Simon
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
[49] Contextualized Word Embeddings in Azerbaijani Language
Alizada, Tural
Suleymanov, Umid
Rustamov, Zaid
2024 IEEE 18TH INTERNATIONAL CONFERENCE ON APPLICATION OF INFORMATION AND COMMUNICATION TECHNOLOGIES, AICT 2024, 2024,
[50] Faster Parallel Training of Word Embeddings
Wszola, Eliza
Jaggi, Martin
Puschel, Markus
2021 IEEE 28TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA, AND ANALYTICS (HIPC 2021), 2021, : 31 - 41

← 1 2 3 4 5 →