Probing a pretrained RoBERTa on Khasi language for POS tagging

被引:1
|
作者
Mitri, Aiom Minnette [1 ]
Lyngdoh, Eusebius Lawai [1 ]
Warjri, Sunita [2 ]
Saha, Goutam [1 ]
Lyngdoh, Saralin A. [3 ]
Maji, Arnab Kumar [3 ]
机构
[1] North Eastern Hill Univ, Dept Informat Technol, Shillong, Meghalaya, India
[2] Univ South Bohemia Ceske, Fac Fisheries & Water Protect, Budejovicich, Czech Republic
[3] North Eastern Hill Univ, Dept Linguist, Shillong, Meghalaya, India
来源
NATURAL LANGUAGE PROCESSING | 2025年 / 31卷 / 02期
关键词
Part of speech tagging; tagging; RoBERTa;
D O I
10.1017/nlp.2024.24
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Part of speech (POS) tagging, though considered to be preliminary to any Natural Language Processing (NLP) task, is crucial to account for, especially in low resource language like Khasi that lacks any form of formal corpus. POS tagging is context sensitive. Therefore, the task is challenging. In this paper, we attempt to investigate a deep learning approach to the POS tagging problem in Khasi. A deep learning model called Robustly Optimized BERT Pretraining Approach (RoBERTa) is pretrained for language modelling task. We then create RoBERTa for POS (RoPOS) tagging, a model that performs POS tagging by fine-tuning the pretrained RoBERTa and leveraging its embeddings for downstream POS tagging. The existing tagset that has been designed, customarily, for the Khasi language is employed for this work, and the corresponding tagged dataset is taken as our base corpus. Further, we propose additional tags to this existing tagset to meet the requirements of the language and have increased the size of the existing Khasi POS corpus. Other machine learning and deep learning models have also been tried and tested for the same task, and a comparative analysis is made on the various models employed. Two different setups have been used for the RoPOS model, and the best testing accuracy achieved is 92 per cent. Comparative analysis of RoPOS with the other models indicates that RoPOS outperforms the others when used for inferencing on texts that are outside the domain of the POS tagged training dataset.
引用
收藏
页码:230 / 249
页数:20
相关论文
共 50 条
  • [31] An Experimental Study on Vietnamese POS Tagging
    Oanh Thi Tran
    Cuong Anh Le
    Thuy Quang Ha
    Quynh Hoang Le
    2009 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING, 2009, : 23 - 27
  • [32] A study of the influence of PoS tagging on WSD
    Moreno-Monteagudo, Lorenza
    Izquierdo-Bevia, Ruben
    Martinez-Barco, Patricio
    Suarez, Armando
    TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2006, 4188 : 173 - 179
  • [33] PoS Tagging for Classical Chinese Text
    Chiu, Tin-shing
    Lu, Qin
    Xu, Jian
    Xiong, Dan
    Lo, Fengju
    CHINESE LEXICAL SEMANTICS (CLSW 2015), 2015, 9332 : 448 - 456
  • [34] Coupled POS Tagging on Heterogeneous Annotations
    Li, Zhenghua
    Chao, Jiayuan
    Zhang, Min
    Chen, Wenliang
    Zhang, Meishan
    Fu, Guohong
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2017, 25 (03) : 557 - 571
  • [35] Experimental Study of Chinese POS Tagging
    Liu Xiaofeng
    PROCEEDINGS OF 2018 THE 2ND INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE (CSAI 2018) / 2018 THE 10TH INTERNATIONAL CONFERENCE ON INFORMATION AND MULTIMEDIA TECHNOLOGY (ICIMT 2018), 2018, : 1 - 5
  • [36] DLAMA: A Framework for Curating Culturally Diverse Facts for Probing the Knowledge of Pretrained Language Models
    Keleg, Amr
    Magdy, Walid
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 6245 - 6266
  • [37] A Cascaded Unsupervised Model for PoS Tagging
    Bolucu, Necva
    Can, Burcu
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2021, 20 (01)
  • [38] A machine learning approach to POS tagging
    Màrquez, L
    Padró, L
    Rodríguez, H
    MACHINE LEARNING, 2000, 39 (01) : 59 - 91
  • [39] A Machine Learning Approach to POS Tagging
    Lluís Màrquez
    Lluís Padró
    Horacio Rodríguez
    Machine Learning, 2000, 39 : 59 - 91
  • [40] From Genesis to Creole Language: Transfer Learning for Singlish Universal Dependencies Parsing and POS Tagging
    Wang, Hongmin
    Yang, Jie
    Zhang, Yue
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2020, 19 (01)