KinyaBERT: a Morphology-aware Kinyarwanda Language Model

被引:0
|
作者
Nzeyimana, Antoine [1 ]
Rubungo, Andre Niyongabo [2 ]
机构
[1] Univ Massachusetts, Amherst, MA 01003 USA
[2] Univ Politecn Cataluna, Barcelona, Spain
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Pre-trained language models such as BERT have been successful at tackling many natural language processing tasks. However, the unsupervised sub-word tokenization methods commonly used in these models (e.g., byte-pair encoding - BPE) are sub-optimal at handling morphologically rich languages. Even given a morphological analyzer, naive sequencing of morphemes into a standard BERT architecture is inefficient at capturing morphological compositionality and expressing word-relative syntactic regularities. We address these challenges by proposing a simple yet effective twotier BERT architecture that leverages a morphological analyzer and explicitly represents morphological compositionality. Despite the success of BERT, most of its evaluations have been conducted on high-resource languages, obscuring its applicability on low-resource languages. We evaluate our proposed method on the low-resource morphologically rich Kinyarwanda language, naming the proposed model architecture KinyaBERT. A robust set of experimental results reveal that KinyaBERT outperforms solid baselines by 2% in F1 score on a named entity recognition task and by 4.3% in average score of a machine-translated GLUE benchmark. KinyaBERT fine-tuning has better convergence and achieves more robust results on multiple tasks even in the presence of translation noise.(1)
引用
收藏
页码:5347 / 5363
页数:17
相关论文
共 50 条
  • [31] CxLM: A Construction and Context-aware Language Model
    Tseng, Yu-Hsiang
    Shih, Cing-Fang
    Chen, Pin-Er
    Chou, Hsin-Yu
    Ku, Mao-Chang
    Hsieh, Shu-Kai
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6361 - 6369
  • [32] Entity-Aware Language Model as an Unsupervised Reranker
    Rasooli, Mohammad Sadegh
    Parthasarathy, Sarangarajan
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 406 - 410
  • [33] Morphological Verb-Aware Tibetan Language Model
    Khysru, Kuntharrgyal
    Jin, Di
    Dang, Jianwu
    IEEE ACCESS, 2019, 7 : 72896 - 72904
  • [34] Morphology aware data augmentation with neural language models for online hybrid ASR
    Tarjan, Balazs
    Fegyo, Tibor
    Mihajlik, Peter
    ACTA LINGUISTICA ACADEMICA, 2022, 69 (04): : 581 - 598
  • [35] Morphology Model and Segmentation for Old Turkic Language
    Zhanabergenova, Dinara
    Tukeyev, Ualsher
    COMPUTATIONAL COLLECTIVE INTELLIGENCE (ICCCI 2021), 2021, 12876 : 629 - 642
  • [36] Time-aware mixed language model for microblog search
    Wei, Bing-Jie
    Wang, Bin
    Jisuanji Xuebao/Chinese Journal of Computers, 2014, 37 (01): : 229 - 237
  • [37] Sharpness-Aware Minimization Improves Language Model Generalization
    Bahri, Dara
    Mobahi, Hossein
    Tay, Yi
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 7360 - 7371
  • [38] A Context-Aware Language Model for Spoken Query Retrieval
    Yapin Zhong
    Juan E. Gilbert
    International Journal of Speech Technology, 2005, 8 (2) : 203 - 219
  • [39] A Discriminative Entity-Aware Language Model for Virtual Assistants
    Saebi, Mandana
    Pusateri, Ernest
    Meghawat, Aaksha
    Van Gysel, Christophe
    INTERSPEECH 2021, 2021, : 2032 - 2036
  • [40] A Context-Aware Language Model for Spoken Query Retrieval
    Zhong, Yapin
    Gilbert, Juan E.
    INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2005, 8 (02) : 203 - 219