Probing a pretrained RoBERTa on Khasi language for POS tagging

被引:1
|
作者
Mitri, Aiom Minnette [1 ]
Lyngdoh, Eusebius Lawai [1 ]
Warjri, Sunita [2 ]
Saha, Goutam [1 ]
Lyngdoh, Saralin A. [3 ]
Maji, Arnab Kumar [3 ]
机构
[1] North Eastern Hill Univ, Dept Informat Technol, Shillong, Meghalaya, India
[2] Univ South Bohemia Ceske, Fac Fisheries & Water Protect, Budejovicich, Czech Republic
[3] North Eastern Hill Univ, Dept Linguist, Shillong, Meghalaya, India
来源
NATURAL LANGUAGE PROCESSING | 2025年 / 31卷 / 02期
关键词
Part of speech tagging; tagging; RoBERTa;
D O I
10.1017/nlp.2024.24
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Part of speech (POS) tagging, though considered to be preliminary to any Natural Language Processing (NLP) task, is crucial to account for, especially in low resource language like Khasi that lacks any form of formal corpus. POS tagging is context sensitive. Therefore, the task is challenging. In this paper, we attempt to investigate a deep learning approach to the POS tagging problem in Khasi. A deep learning model called Robustly Optimized BERT Pretraining Approach (RoBERTa) is pretrained for language modelling task. We then create RoBERTa for POS (RoPOS) tagging, a model that performs POS tagging by fine-tuning the pretrained RoBERTa and leveraging its embeddings for downstream POS tagging. The existing tagset that has been designed, customarily, for the Khasi language is employed for this work, and the corresponding tagged dataset is taken as our base corpus. Further, we propose additional tags to this existing tagset to meet the requirements of the language and have increased the size of the existing Khasi POS corpus. Other machine learning and deep learning models have also been tried and tested for the same task, and a comparative analysis is made on the various models employed. Two different setups have been used for the RoPOS model, and the best testing accuracy achieved is 92 per cent. Comparative analysis of RoPOS with the other models indicates that RoPOS outperforms the others when used for inferencing on texts that are outside the domain of the POS tagged training dataset.
引用
收藏
页码:230 / 249
页数:20
相关论文
共 50 条
  • [1] A Hybrid POS Tagger for Khasi, an Under Resourced Language
    Tham, Medari Janai
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2020, 11 (10) : 333 - 342
  • [2] Identification of POS Tag for Khasi Language based on Hidden Markov Model POS Tagger
    Warjri, Sunita
    Pakray, Partha
    Lyngdoh, Saralin
    Maji, Arnab Kumar
    COMPUTACION Y SISTEMAS, 2019, 23 (03): : 795 - 802
  • [3] Discourse Probing of Pretrained Language Models
    Koto, Fajri
    Lau, Jey Han
    Baldwin, Timothy
    2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 3849 - 3864
  • [4] Part-of-Speech (POS) Tagging Using Deep Learning-Based Approaches on the Designed Khasi POS Corpus
    Warjri, Sunita
    Pakray, Partha
    Lyngdoh, Saralin A.
    Maji, Arnab Kumar
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2022, 21 (03)
  • [5] Issues in Parsing and POS Tagging of Hybrid Language
    Atrey, Shree Harsh
    Prasad, T. V.
    Krishna, G. Rama
    2012 IEEE INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND CYBERNETICS (CYBERNETICSCOM), 2012, : 20 - 24
  • [6] Probing Pretrained Language Models with Hierarchy Properties
    Lovon-Melgarejo, Jesus
    Moreno, Jose G.
    Besancon, Romaric
    Ferret, Olivier
    Tamine, Lynda
    ADVANCES IN INFORMATION RETRIEVAL, ECIR 2024, PT II, 2024, 14609 : 126 - 142
  • [7] Probing Pretrained Language Models for Lexical Semantics
    Vulie, Ivan
    Ponti, Edoardo M.
    Litschko, Robert
    Glava, Goran
    Korhonen, Anna
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 7222 - 7240
  • [8] RESTFul POS tagging WEB Service for Sinhala language
    Jayaweera, A. J. P. M. P.
    Dias, N. G. J.
    2015 Fifteenth International Conference on Advances in ICT for Emerging Regions (ICTer), 2015, : 50 - 57
  • [9] Part-of-Speech (POS) Tagging for the Nyishi Language
    Siram, Joyir
    Sambyo, Koj
    Sarkar, Achyuth
    ADVANCES IN INFORMATION COMMUNICATION TECHNOLOGY AND COMPUTING, AICTC 2021, 2022, 392 : 191 - 199
  • [10] Unknown Words Analysis in POS tagging of Sinhala Language
    Jayaweera, A. J. P. M. P.
    Dias, N. G. J.
    14TH INTERNATIONAL CONFERENCE ON ADVANCES IN ICT FOR EMERGING REGIONS (ICTER) 2014, 2014, : 270 - 270