Probing a pretrained RoBERTa on Khasi language for POS tagging

被引:1
|
作者
Mitri, Aiom Minnette [1 ]
Lyngdoh, Eusebius Lawai [1 ]
Warjri, Sunita [2 ]
Saha, Goutam [1 ]
Lyngdoh, Saralin A. [3 ]
Maji, Arnab Kumar [3 ]
机构
[1] North Eastern Hill Univ, Dept Informat Technol, Shillong, Meghalaya, India
[2] Univ South Bohemia Ceske, Fac Fisheries & Water Protect, Budejovicich, Czech Republic
[3] North Eastern Hill Univ, Dept Linguist, Shillong, Meghalaya, India
来源
NATURAL LANGUAGE PROCESSING | 2025年 / 31卷 / 02期
关键词
Part of speech tagging; tagging; RoBERTa;
D O I
10.1017/nlp.2024.24
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Part of speech (POS) tagging, though considered to be preliminary to any Natural Language Processing (NLP) task, is crucial to account for, especially in low resource language like Khasi that lacks any form of formal corpus. POS tagging is context sensitive. Therefore, the task is challenging. In this paper, we attempt to investigate a deep learning approach to the POS tagging problem in Khasi. A deep learning model called Robustly Optimized BERT Pretraining Approach (RoBERTa) is pretrained for language modelling task. We then create RoBERTa for POS (RoPOS) tagging, a model that performs POS tagging by fine-tuning the pretrained RoBERTa and leveraging its embeddings for downstream POS tagging. The existing tagset that has been designed, customarily, for the Khasi language is employed for this work, and the corresponding tagged dataset is taken as our base corpus. Further, we propose additional tags to this existing tagset to meet the requirements of the language and have increased the size of the existing Khasi POS corpus. Other machine learning and deep learning models have also been tried and tested for the same task, and a comparative analysis is made on the various models employed. Two different setups have been used for the RoPOS model, and the best testing accuracy achieved is 92 per cent. Comparative analysis of RoPOS with the other models indicates that RoPOS outperforms the others when used for inferencing on texts that are outside the domain of the POS tagged training dataset.
引用
收藏
页码:230 / 249
页数:20
相关论文
共 50 条
  • [41] NanoBERTa-ASP: predicting nanobody paratope based on a pretrained RoBERTa model
    Li, Shangru
    Meng, Xiangpeng
    Li, Rui
    Huang, Bingding
    Wang, Xin
    BMC BIOINFORMATICS, 2024, 25 (01)
  • [42] NanoBERTa-ASP: predicting nanobody paratope based on a pretrained RoBERTa model
    Shangru Li
    Xiangpeng Meng
    Rui Li
    Bingding Huang
    Xin Wang
    BMC Bioinformatics, 25
  • [43] KHASI, A LANGUAGE OF ASSAM - RABEL,L
    ANDERSON, B
    AMERICAN ANTHROPOLOGIST, 1962, 64 (06) : 1355 - 1356
  • [44] KHASI, A LANGUAGE OF ASSAM - RABEL,L
    TEETER, KV
    LANGUAGE, 1963, 39 (02) : 341 - 346
  • [45] Modeling of learning curves with applications to POS tagging
    Vilares Ferro, Manuel
    Darriba Bilbao, Victor Manuel
    Ribadas Pena, Francisco Jose
    COMPUTER SPEECH AND LANGUAGE, 2017, 41 : 1 - 28
  • [46] Chinese Word POS Tagging with Markov Logic
    Liao, Zhihua
    Zeng, Qixian
    Wang, Qiyun
    INTELLIGENCE AND SECURITY INFORMATICS, PAISI 2015, 2015, 9074 : 91 - 101
  • [47] A hybrid approach to word segmentation and POS tagging
    Oki Electric Industry Co., Ltd., 2−5−7 Honmachi, Chuo-ku, Osaka
    541−0053, Japan
    不详
    619−0289, Japan
    Proc. Annu. Meet. Assoc. Comput Linguist., 1600, (217-220):
  • [48] FlexTag: A Highly Flexible PoS Tagging Framework
    Zesch, Torsten
    Horsmann, Tobias
    LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2016, : 4259 - 4263
  • [49] Tuning SyntaxNet for POS Tagging Italian Sentences
    Marulli, Fiammetta
    Pota, Marco
    Esposito, Massimo
    Maisto, Alessandro
    Guarasci, Raffaele
    ADVANCES ON P2P, PARALLEL, GRID, CLOUD AND INTERNET COMPUTING (3PGCIC-2017), 2018, 13 : 314 - 324
  • [50] A Transliteration of CRF Based Manipuri POS Tagging
    Nongmeikapam, Kishorjit
    Bandyopadhyay, Sivaji
    2ND INTERNATIONAL CONFERENCE ON COMMUNICATION, COMPUTING & SECURITY [ICCCS-2012], 2012, 1 : 582 - 589