Probing a pretrained RoBERTa on Khasi language for POS tagging

被引:1
|
作者
Mitri, Aiom Minnette [1 ]
Lyngdoh, Eusebius Lawai [1 ]
Warjri, Sunita [2 ]
Saha, Goutam [1 ]
Lyngdoh, Saralin A. [3 ]
Maji, Arnab Kumar [3 ]
机构
[1] North Eastern Hill Univ, Dept Informat Technol, Shillong, Meghalaya, India
[2] Univ South Bohemia Ceske, Fac Fisheries & Water Protect, Budejovicich, Czech Republic
[3] North Eastern Hill Univ, Dept Linguist, Shillong, Meghalaya, India
来源
NATURAL LANGUAGE PROCESSING | 2025年 / 31卷 / 02期
关键词
Part of speech tagging; tagging; RoBERTa;
D O I
10.1017/nlp.2024.24
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Part of speech (POS) tagging, though considered to be preliminary to any Natural Language Processing (NLP) task, is crucial to account for, especially in low resource language like Khasi that lacks any form of formal corpus. POS tagging is context sensitive. Therefore, the task is challenging. In this paper, we attempt to investigate a deep learning approach to the POS tagging problem in Khasi. A deep learning model called Robustly Optimized BERT Pretraining Approach (RoBERTa) is pretrained for language modelling task. We then create RoBERTa for POS (RoPOS) tagging, a model that performs POS tagging by fine-tuning the pretrained RoBERTa and leveraging its embeddings for downstream POS tagging. The existing tagset that has been designed, customarily, for the Khasi language is employed for this work, and the corresponding tagged dataset is taken as our base corpus. Further, we propose additional tags to this existing tagset to meet the requirements of the language and have increased the size of the existing Khasi POS corpus. Other machine learning and deep learning models have also been tried and tested for the same task, and a comparative analysis is made on the various models employed. Two different setups have been used for the RoPOS model, and the best testing accuracy achieved is 92 per cent. Comparative analysis of RoPOS with the other models indicates that RoPOS outperforms the others when used for inferencing on texts that are outside the domain of the POS tagged training dataset.
引用
收藏
页码:230 / 249
页数:20
相关论文
共 50 条
  • [21] Experiments on POS Tagging and Data Driven Dependency Parsing for Telugu Language
    Khanam, Mayana Humera
    Suryachandra, Palli
    Madhumurthy, K. V.
    PROCEEDINGS OF THE 2012 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI'12), 2012, : 1068 - 1073
  • [22] Resource Building and Parts-of-Speech (POS) Tagging for the Mizo Language
    Pakray, Partha
    Pal, Arunagshu
    Majumder, Goutam
    Gelbukh, Alexander
    2015 FOURTEENTH MEXICAN INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (MICAI), 2015, : 3 - 7
  • [23] A Machine Learning Approach to POS Tagging Case study: Amazighe language
    Samir, Amri
    Rkia, Bani
    Lahbib, Zenkouar
    Zouhair, Guennoun
    2022 2ND INTERNATIONAL CONFERENCE ON INNOVATIVE RESEARCH IN APPLIED SCIENCE, ENGINEERING AND TECHNOLOGY (IRASET'2022), 2022, : 410 - 413
  • [24] Telugu Language Analysis with XLM-RoBERTa: Enhancing Parts of Speech Tagging for Effective Natural Language Processing
    G. Bharathi Mohan
    R. Prasanna Kumar
    K. Krishna Jayanth
    Srinath Doss
    SN Computer Science, 6 (2)
  • [25] Identification of POS Tags for the Khasi Language based on Brill's Transformation Rule-Based Tagger
    Warjri, Sunita
    Pakray, Partha
    Lyngdoh, Saralin A.
    Maji, Arnab Kumar
    COMPUTACION Y SISTEMAS, 2022, 26 (02): : 989 - 1005
  • [26] HYPERTAGS: beyond POS tagging
    Kinyon, A
    NATURAL LANGUAGE PROCESSING-NLP 2000, PROCEEDINGS, 2000, 1835 : 81 - 91
  • [27] A Deep Learning-Based Approach for Part of Speech (PoS) Tagging in the Pashto Language
    Ullah, Shaheen
    Ahmad, Riaz
    Namoun, Abdallah
    Muhammad, Siraj
    Ullah, Khalil
    Hussain, Ibrar
    Ibrahim, Isa Ali
    IEEE ACCESS, 2024, 12 : 86355 - 86364
  • [28] The Development of Indonesian POS Tagging System for Computer-aided Independent Language Learning
    Afini M.U.
    Supriyanto C.
    Nugroho R.A.
    International Journal of Emerging Technologies in Learning, 2017, 12 (11) : 138 - 150
  • [29] POS Tagging of Assamese Language and Performance Analysis of CRF plus plus and fnTBL Approaches
    Barman, Anup Kumar
    Sarmah, Jumi
    Sarma, Shikhar Kr.
    UKSIM-AMSS 15TH INTERNATIONAL CONFERENCE ON COMPUTER MODELLING AND SIMULATION (UKSIM 2013), 2013, : 476 - 479
  • [30] POS-tagging of Historical Dutch
    Hupkes, Dieuwke
    Bod, Rens
    LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2016, : 77 - 82