Getting More Data for Low-resource Morphological Inflection: Language Models and Data Augmentation

被引:0
|
作者
Sorokin, Alexey [1 ]
机构
[1] Moscow MV Lomonosov State Univ, Moscow Inst Phys & Technol, Fac Math & Mech, Leninskie Gory,GSP 1, Moscow, Russia
关键词
inflection; encoder-decoder; abstract paradigms; language models; data augmentation;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
We investigate the effect of data augmentation on low-resource morphological segmentation. We compare two settings: the pure low-resource one, when only 100 annotated word forms are available, and the augmented one, where we use the original training set and 1000 unlabeled word forms to generate 1000 artificial inflected forms. Evaluating on Sigmorphon 2018 dataset, we observe that using the best among these two models reduces the error rate of state-of-the-art model by 6%, while for our baseline model the error reduction is 17%
引用
收藏
页码:3978 / 3983
页数:6
相关论文
共 50 条
  • [41] Enhancing African low-resource languages: Swahili data for language modelling
    Shikali, Casper S.
    Mokhosi, Refuoe
    DATA IN BRIEF, 2020, 31
  • [42] Enhancement of Named Entity Recognition in Low-Resource Languages with Data Augmentation and BERT Models: A Case Study on Urdu
    Ullah, Fida
    Gelbukh, Alexander
    Zamir, Muhammad Tayyab
    Riveron, Edgardo Manuel Felipe
    Sidorov, Grigori
    COMPUTERS, 2024, 13 (10)
  • [43] Contrastive Learning for Morphological Disambiguation Using Large Language Models in Low-Resource Settings
    Tolegen, Gulmira
    Toleu, Alymzhan
    Mussabayev, Rustam
    APPLIED SCIENCES-BASEL, 2024, 14 (21):
  • [44] Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation
    Zolzaya Byambadorj
    Ryota Nishimura
    Altangerel Ayush
    Kengo Ohta
    Norihide Kitaoka
    EURASIP Journal on Audio, Speech, and Music Processing, 2021
  • [45] Low-data? No problem: low-resource, language-agnostic conversational text-to-speech via F0-conditioned data augmentation
    Comini, Giulia
    Huybrechts, Goeric
    Ribeiro, Manuel Sam
    Gabrys, Adam
    Lorenzo-Trueba, Jaime
    INTERSPEECH 2022, 2022, : 1946 - 1950
  • [46] Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation
    Byambadorj, Zolzaya
    Nishimura, Ryota
    Ayush, Altangerel
    Ohta, Kengo
    Kitaoka, Norihide
    EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2021, 2021 (01)
  • [47] Efficient Data Augmentation via lexical matching for boosting performance on Statistical Machine Translation for Indic and a Low-resource language
    Saxena, Shefali
    Gupta, Ayush
    Daniel, Philemon
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (24) : 64255 - 64269
  • [48] Cognate Projection for Low-Resource Inflection Generation
    Hauer, Bradley
    Habibi, Amir A.
    Luan, Yixing
    Riyadh, Rashed Rubby
    Kondrak, Grzegorz
    16TH SIGMORPHON WORKSHOP ON COMPUTATIONAL RESEARCH IN PHONETICS PHONOLOGY, AND MORPHOLOGY (SIGMORPHON 2019), 2019, : 6 - 11
  • [49] Data-driven Model Generalizability in Crosslinguistic Low-resource Morphological Segmentation
    Liu, Zoey
    Prud'hommeaux, Emily
    TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2022, 10 : 393 - 413
  • [50] On the scalability of data augmentation techniques for low-resource machine translation between Chinese and Vietnamese
    Vu, Huan
    Bui, Ngoc Dung
    JOURNAL OF INFORMATION AND TELECOMMUNICATION, 2023, 7 (02) : 241 - 253