isiZulu Word Embeddings

被引：0

作者：

Dlamini, Sibonelo ^{[1
]}

Jembere, Edgar ^{[1
]}

Pillay, Anban ^{[1
]}

van Niekerk, Brett ^{[1
]}

机构：

[1] Univ KwaZulu Natal, Dept Comp Sci, Durban, South Africa

来源：

2021 CONFERENCE ON INFORMATION COMMUNICATIONS TECHNOLOGY AND SOCIETY (ICTAS) | 2021年

关键词：

isiZulu; word embeddings; semantic relatedness; agglutinative language; subword embeddings;

D O I：

10.1109/ICTAS50802.2021.9395011

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Word embeddings are currently the most popular vector space model in Natural Language Processing. How we encode words is important because it affects the performance of many downstream tasks such as Machine Translation (MT), Information Retrieval (IR) and Automatic Speech Recognition (ASR). While much focus has been placed on constructing word embeddings for English, very little attention is paid to under-resourced languages, especially native African languages. In this paper we select four popular word embedding models (Word2Vec CBOW and Skip-Gram; FastText and GloVe) and train them on the 10 million token isiZulu National Corpus (INC) to create isiZulu word embeddings. To the best of our knowledge, this is the first time that word embeddings in isiZulu have been constructed and made available to the public. We create a semantic similarity data set analogous to WordSim353, which we also make publicly available. This data set is used to conduct an evaluation of the four models to determine which is the best for creating isiZulu word embeddings in a low-resource (small corpus) setting. We found that the Word2Vec Skip-Gram model produced the highest quality embeddings, as measured by this semantic similarity task. However, it was the GloVe model which performed best on the nearest neighbours task.

引用

页码：121 / 126

页数：6

共 50 条

[31] Word Embeddings Evaluation and Combination
Ghannay, Sahar
Favre, Benoit
Esteve, Yannick
Camelin, Nathalie
LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2016, : 300 - 305
[32] Eigenwords: Spectral word embeddings
Dhillon, Paramveer S.
Foster, Dean P.
Ungar, Lyle H.
Journal of Machine Learning Research, 2015, 16 : 3035 - 3078
[33] Complementary Learning of Word Embeddings
Song, Yan
Shi, Shuming
PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 4368 - 4374
[34] Unsupervised Multilingual Word Embeddings
Chen, Xilun
Cardie, Claire
2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 261 - 270
[35] Unsupervised Word Sense Disambiguation Using Word Embeddings
Moradi, Behzad
Ansari, Ebrahim
Zabokrtsky, Zdenek
PROCEEDINGS OF THE 2019 25TH CONFERENCE OF OPEN INNOVATIONS ASSOCIATION (FRUCT), 2019, : 228 - 233
[36] Integrating Word Embeddings into IBM Word Alignment Models
Anh-Cuong Le
Tuan-Phong Nguyen
Quoc-Long Tran
PROCEEDINGS OF 2018 10TH INTERNATIONAL CONFERENCE ON KNOWLEDGE AND SYSTEMS ENGINEERING (KSE), 2018, : 79 - 84
[37] Semantic Word Cloud Generation Based on Word Embeddings
Xu, Jin
Tao, Yubo
Lin, Hai
2016 IEEE PACIFIC VISUALIZATION SYMPOSIUM (PACIFICVIS), 2016, : 239 - 243
[38] Intrinsic and Extrinsic Evaluations of Word Embeddings
Zhai, Michael
Tan, Johnny
Choi, Jinho D.
THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2016, : 4282 - 4283
[39] Word Embeddings for Arabic Sentiment Analysis
Altowayan, A. Aziz
Tao, Lixin
2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2016, : 3820 - 3825
[40] Gender Bias in Contextualized Word Embeddings
Zhao, Jieyu
Wangt, Tianlu
Yatskart, Mark
Cotterell, Ryan
Ordonezt, Vicente
Chang, Kai-Wei
2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, 2019, : 629 - 634

← 1 2 3 4 5 →