KinyaBERT: a Morphology-aware Kinyarwanda Language Model

被引:0
|
作者
Nzeyimana, Antoine [1 ]
Rubungo, Andre Niyongabo [2 ]
机构
[1] Univ Massachusetts, Amherst, MA 01003 USA
[2] Univ Politecn Cataluna, Barcelona, Spain
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Pre-trained language models such as BERT have been successful at tackling many natural language processing tasks. However, the unsupervised sub-word tokenization methods commonly used in these models (e.g., byte-pair encoding - BPE) are sub-optimal at handling morphologically rich languages. Even given a morphological analyzer, naive sequencing of morphemes into a standard BERT architecture is inefficient at capturing morphological compositionality and expressing word-relative syntactic regularities. We address these challenges by proposing a simple yet effective twotier BERT architecture that leverages a morphological analyzer and explicitly represents morphological compositionality. Despite the success of BERT, most of its evaluations have been conducted on high-resource languages, obscuring its applicability on low-resource languages. We evaluate our proposed method on the low-resource morphologically rich Kinyarwanda language, naming the proposed model architecture KinyaBERT. A robust set of experimental results reveal that KinyaBERT outperforms solid baselines by 2% in F1 score on a named entity recognition task and by 4.3% in average score of a machine-translated GLUE benchmark. KinyaBERT fine-tuning has better convergence and achieves more robust results on multiple tasks even in the presence of translation noise.(1)
引用
收藏
页码:5347 / 5363
页数:17
相关论文
共 50 条
  • [1] Text-to-Speech for Low-Resource Agglutinative Language With Morphology-Aware Language Model Pre-Training
    Liu, Rui
    Hu, Yifan
    Zuo, Haolin
    Luo, Zhaojie
    Wang, Longbiao
    Gao, Guanglai
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 1075 - 1087
  • [2] Morphology-Aware Interactive Keypoint Estimation
    Kim, Jinhee
    Kim, Taesung
    Kim, Taewoo
    Choo, Jaegul
    Kim, Dong-Wook
    Ahn, Byungduk
    Song, In-Seok
    Kim, Yoon-Ji
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2022, PT III, 2022, 13433 : 675 - 685
  • [3] A Morphology-Aware Network for Morphological Disambiguation
    Yildiz, Eray
    Tirkaz, Caglar
    Sahin, H. Bahadir
    Eren, Mustafa Tolga
    Sonmez, Ozan
    THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2016, : 2863 - 2869
  • [4] MAAM: A Morphology-Aware Alignment Model for Unsupervised Bilingual Lexicon Induction
    Yang, Pengcheng
    Luo, Fuli
    Chen, Peng
    Liu, Tianyu
    Sun, Xu
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 3190 - 3196
  • [5] Morphology-Aware Meta-Embeddings for Tamil
    Krishnan, Arjun
    Ragavan, Seyoon
    2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 94 - 111
  • [6] Morphology-Aware Spell-Checking Dictionary for Esperanto
    Blahus, Marek
    RASLAN 2009: RECENT ADVANCES IN SLAVONIC NATURAL LANGUAGE PROCESSING, 2009, : 3 - 8
  • [7] Phonology and morphology of the borrowing language: Integration of French loanwords in Kinyarwanda
    Rose, Y
    EXPLORATION OF LEXICONS, 1997, B-20 : 253 - 264
  • [8] The Estonian Reference Corpus: Its Composition and Morphology-aware User Interface
    Kaalep, Heiki-Jaan
    Muischnek, Kadri
    Uiboaed, Kristel
    Veskis, Kaarel
    HUMAN LANGUAGE TECHNOLOGIES - THE BALTIC PERSPECTIVE, 2010, 219 : 143 - 146
  • [9] The Effectiveness of Morphology-aware Segmentation in Low-Resource Neural Machine Translation
    Saleva, Jonne
    Lignos, Constantine
    EACL 2021: THE 16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: PROCEEDINGS OF THE STUDENT RESEARCH WORKSHOP, 2021, : 164 - 174
  • [10] Morphology-aware multi-source fusion–based intracranial aneurysms rupture prediction
    Chubin Ou
    Caizi Li
    Yi Qian
    Chuan-Zhi Duan
    Weixin Si
    Xin Zhang
    Xifeng Li
    Michael Morgan
    Qi Dou
    Pheng-Ann Heng
    European Radiology, 2022, 32 : 5633 - 5641