Leveraging Large Language Models in Low-resourced Language NLP: A spaCy Implementation for Modern Tibetan

被引：0

作者：

Kyogoku, Yuki ^{[1
]}

Erhard, Franz Xaver ^{[1
]}

Engels, James ^{[2
]}

Barnett, Robert ^{[3
]}

机构：

[1] Univ Leipzig, Leipzig, Germany

[2] Univ Edinburgh, Edinburgh, Scotland

[3] SOAS Univ London, London, England

来源：

REVUE D ETUDES TIBETAINES | 2025年 / 74期

关键词：

D O I：

暂无

中图分类号：

C [社会科学总论];

学科分类号：

03 ; 0303 ;

摘要：

Large Language Models (LLMs) are transforming the possibilities for developing Natural Language Processing (NLP) tools for low-resource languages. While languages like Modern Tibetan have historically faced significant challenges in computational linguistics due to limited digital resources and annotated datasets, LLMs offer a promising solution. This paper describes how we leveraged Google's Gemini Pro 1.5 to generate training data for developing a basic spaCy language model for Modern Tibetan, focusing particularly on Part-of-Speech (POS) tagging. Combining traditional rule-based approaches with LLM-assisted data annotation, we demonstrate a novel methodology for creating NLP tools for languages with limited computational resources. Our findings contribute to the broader effort to enhance digital accessibility for low-resource languages while offering practical insights for similar projects in computational linguistics.

引用

页数：34

共 50 条

[21] Leveraging large language models in dermatology
Matin, Rubeta N.
Linos, Eleni
Rajan, Neil
BRITISH JOURNAL OF DERMATOLOGY, 2023, 189 (03) : 253 - 254
[22] Transformer-based Machine Translation for Low-resourced Languages embedded with Language Identification
Sefara, Tshephisho J.
Zwane, Skhumbuzo G.
Gama, Nelisiwe
Sibisi, Hlawulani
Senoamadi, Phillemon N.
Marivate, Vukosi
2021 CONFERENCE ON INFORMATION COMMUNICATIONS TECHNOLOGY AND SOCIETY (ICTAS), 2021, : 127 - 132
[23] Text Classification of News Articles Using Machine Learning on Low-resourced Language: Tigrigna
Fesseha, Awet
Xiong, Shengwu
Emiru, Eshete Derb
Dahou, Abdelghani
2020 3RD INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND BIG DATA (ICAIBD 2020), 2020, : 34 - 38
[24] Analysis of Automatic Evaluation Metric on Low-Resourced Language: BERTScore vs BLEU Score
Datta, Goutam
Joshi, Nisheeth
Gupta, Kusum
SPEECH AND COMPUTER, SPECOM 2022, 2022, 13721 : 155 - 162
[25] END-TO-END CODE-SWITCHING ASR FOR LOW-RESOURCED LANGUAGE PAIRS
Yue, Xianghu
Lee, Grandee
Yilmaz, Emre
Deng, Fang
Li, Haizhou
2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 972 - 979
[26] Large Language Models and Low-Resource Languages: An Examination of Armenian NLP
Avetisyan, Hayastan
Broneske, David
13TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING AND THE 3RD CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, IJCNLP-AACL 2023, 2023, : 199 - 210
[27] Regulation and NLP (RegNLP): Taming Large Language Models
Goanta, Catalina
Aletras, Nikolaos
Chalkidis, Ilias
Ranchordas, Sofia
Spanakis, Gerasimos
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 8712 - 8724
[28] Regulation and NLP (RegNLP): Taming Large Language Models
Goanta, Catalina
Aletras, Nikolaos
Chalkidis, Ilias
Ranchordas, Sofia
Spanakis, Gerasimos
EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings, 2023, : 8712 - 8724
[29] BERT-Based Sentiment Analysis for Low-Resourced Languages: A Case Study of Urdu Language
Ashraf, Muhammad Rehan
Jana, Yasmeen
Umer, Qasim
Jaffar, M. Arfan
Chung, Sungwook
Ramay, Waheed Yousuf
IEEE ACCESS, 2023, 11 : 110245 - 110259
[30] Leveraging large language models for predictive chemistry
Kevin Maik Jablonka
Philippe Schwaller
Andres Ortega-Guerrero
Berend Smit
Nature Machine Intelligence, 2024, 6 : 161 - 169

← 1 2 3 4 5 →