Leveraging Large Language Models in Low-resourced Language NLP: A spaCy Implementation for Modern Tibetan

被引:0
|
作者
Kyogoku, Yuki [1 ]
Erhard, Franz Xaver [1 ]
Engels, James [2 ]
Barnett, Robert [3 ]
机构
[1] Univ Leipzig, Leipzig, Germany
[2] Univ Edinburgh, Edinburgh, Scotland
[3] SOAS Univ London, London, England
来源
REVUE D ETUDES TIBETAINES | 2025年 / 74期
关键词
D O I
暂无
中图分类号
C [社会科学总论];
学科分类号
03 ; 0303 ;
摘要
Large Language Models (LLMs) are transforming the possibilities for developing Natural Language Processing (NLP) tools for low-resource languages. While languages like Modern Tibetan have historically faced significant challenges in computational linguistics due to limited digital resources and annotated datasets, LLMs offer a promising solution. This paper describes how we leveraged Google's Gemini Pro 1.5 to generate training data for developing a basic spaCy language model for Modern Tibetan, focusing particularly on Part-of-Speech (POS) tagging. Combining traditional rule-based approaches with LLM-assisted data annotation, we demonstrate a novel methodology for creating NLP tools for languages with limited computational resources. Our findings contribute to the broader effort to enhance digital accessibility for low-resource languages while offering practical insights for similar projects in computational linguistics.
引用
收藏
页数:34
相关论文
共 50 条
  • [21] Leveraging large language models in dermatology
    Matin, Rubeta N.
    Linos, Eleni
    Rajan, Neil
    BRITISH JOURNAL OF DERMATOLOGY, 2023, 189 (03) : 253 - 254
  • [22] Transformer-based Machine Translation for Low-resourced Languages embedded with Language Identification
    Sefara, Tshephisho J.
    Zwane, Skhumbuzo G.
    Gama, Nelisiwe
    Sibisi, Hlawulani
    Senoamadi, Phillemon N.
    Marivate, Vukosi
    2021 CONFERENCE ON INFORMATION COMMUNICATIONS TECHNOLOGY AND SOCIETY (ICTAS), 2021, : 127 - 132
  • [23] Text Classification of News Articles Using Machine Learning on Low-resourced Language: Tigrigna
    Fesseha, Awet
    Xiong, Shengwu
    Emiru, Eshete Derb
    Dahou, Abdelghani
    2020 3RD INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND BIG DATA (ICAIBD 2020), 2020, : 34 - 38
  • [24] Analysis of Automatic Evaluation Metric on Low-Resourced Language: BERTScore vs BLEU Score
    Datta, Goutam
    Joshi, Nisheeth
    Gupta, Kusum
    SPEECH AND COMPUTER, SPECOM 2022, 2022, 13721 : 155 - 162
  • [25] END-TO-END CODE-SWITCHING ASR FOR LOW-RESOURCED LANGUAGE PAIRS
    Yue, Xianghu
    Lee, Grandee
    Yilmaz, Emre
    Deng, Fang
    Li, Haizhou
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 972 - 979
  • [26] Large Language Models and Low-Resource Languages: An Examination of Armenian NLP
    Avetisyan, Hayastan
    Broneske, David
    13TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING AND THE 3RD CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, IJCNLP-AACL 2023, 2023, : 199 - 210
  • [27] Regulation and NLP (RegNLP): Taming Large Language Models
    Goanta, Catalina
    Aletras, Nikolaos
    Chalkidis, Ilias
    Ranchordas, Sofia
    Spanakis, Gerasimos
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 8712 - 8724
  • [28] Regulation and NLP (RegNLP): Taming Large Language Models
    Goanta, Catalina
    Aletras, Nikolaos
    Chalkidis, Ilias
    Ranchordas, Sofia
    Spanakis, Gerasimos
    EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings, 2023, : 8712 - 8724
  • [29] BERT-Based Sentiment Analysis for Low-Resourced Languages: A Case Study of Urdu Language
    Ashraf, Muhammad Rehan
    Jana, Yasmeen
    Umer, Qasim
    Jaffar, M. Arfan
    Chung, Sungwook
    Ramay, Waheed Yousuf
    IEEE ACCESS, 2023, 11 : 110245 - 110259
  • [30] Leveraging large language models for predictive chemistry
    Kevin Maik Jablonka
    Philippe Schwaller
    Andres Ortega-Guerrero
    Berend Smit
    Nature Machine Intelligence, 2024, 6 : 161 - 169