Leveraging Large Language Models in Low-resourced Language NLP: A spaCy Implementation for Modern Tibetan

被引:0
|
作者
Kyogoku, Yuki [1 ]
Erhard, Franz Xaver [1 ]
Engels, James [2 ]
Barnett, Robert [3 ]
机构
[1] Univ Leipzig, Leipzig, Germany
[2] Univ Edinburgh, Edinburgh, Scotland
[3] SOAS Univ London, London, England
来源
REVUE D ETUDES TIBETAINES | 2025年 / 74期
关键词
D O I
暂无
中图分类号
C [社会科学总论];
学科分类号
03 ; 0303 ;
摘要
Large Language Models (LLMs) are transforming the possibilities for developing Natural Language Processing (NLP) tools for low-resource languages. While languages like Modern Tibetan have historically faced significant challenges in computational linguistics due to limited digital resources and annotated datasets, LLMs offer a promising solution. This paper describes how we leveraged Google's Gemini Pro 1.5 to generate training data for developing a basic spaCy language model for Modern Tibetan, focusing particularly on Part-of-Speech (POS) tagging. Combining traditional rule-based approaches with LLM-assisted data annotation, we demonstrate a novel methodology for creating NLP tools for languages with limited computational resources. Our findings contribute to the broader effort to enhance digital accessibility for low-resource languages while offering practical insights for similar projects in computational linguistics.
引用
收藏
页数:34
相关论文
共 50 条
  • [1] Performance of Recent Large Language Models for a Low-Resourced Language
    Jayakody, Ravindu
    Dias, Gihan
    2024 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING, IALP 2024, 2024, : 162 - 167
  • [2] An Automatic Summarizer for a Low-Resourced Language
    Pattnaik, Sagarika
    Nayak, Ajit Kumar
    ADVANCED COMPUTING AND INTELLIGENT ENGINEERING, 2020, 1082 : 285 - 295
  • [3] Toward the Development of Large-Scale Word Embedding for Low-Resourced Language
    Nazir, Shahzad
    Asif, Muhammad
    Sahi, Shahbaz Ahmad
    Ahmad, Shahbaz
    Ghadi, Yazeed Yasin
    Aziz, Muhammad Haris
    IEEE ACCESS, 2022, 10 : 54091 - 54097
  • [4] Question-Answering in a Low-resourced Language: Benchmark Dataset and Models for Tigrinya
    Gaim, Fitsum
    Yang, Wonsuk
    Park, Hancheol
    Park, Jong C.
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 11857 - 11870
  • [5] A Spell Checker for a Low-resourced and Morphologically Rich Language
    Octaviano, Manolito, Jr.
    Borra, Allan
    TENCON 2017 - 2017 IEEE REGION 10 CONFERENCE, 2017, : 1853 - 1856
  • [6] Gramatika: A Grammar Checker for the Low-Resourced Filipino Language
    Go, Matthew Phillip
    Nocon, Nicco
    Borra, Allan
    TENCON 2017 - 2017 IEEE REGION 10 CONFERENCE, 2017, : 471 - 475
  • [7] Explainable Pre-Trained Language Models for Sentiment Analysis in Low-Resourced Languages
    Mabokela, Koena Ronny
    Primus, Mpho
    Celik, Turgay
    BIG DATA AND COGNITIVE COMPUTING, 2024, 8 (11)
  • [8] A Need Finding Study with Low-Resourced Language Content Creators
    Nigatu, Hellina Hailu
    Canny, John
    Chasins, Sarah
    PROCEEDINGS OF THE 4TH AFRICAN CONFERENCE FOR HUMAN COMPUTER INTERACTION, AFRICHI 2023, 2023, : 1 - 4
  • [9] A First LVCSR System for Luxembourgish, a Low-Resourced European Language
    Adda-Decker, Martine
    Lamel, Lori
    Adda, Gilles
    Lavergne, Thomas
    HUMAN LANGUAGE TECHNOLOGY CHALLENGES FOR COMPUTER SCIENCE AND LINGUISTICS, 2014, 8387 : 479 - 490
  • [10] Common latent representation learning for low-resourced spoken language identification
    Chen, Chen
    Bu, Yulin
    Chen, Yong
    Chen, Deyun
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (12) : 34515 - 34535