Data Augmentation Methods for Low-Resource Orthographic Syllabification

被引:3
|
作者
Suyanto, Suyanto [1 ]
Lhaksmana, Kemas M. [1 ]
Bijaksana, Moch Arif [1 ]
Kurniawan, Adriana [1 ]
机构
[1] Telkom Univ, Sch Comp, Bandung 40257, Indonesia
关键词
Indonesian; flipping onsets; orthographic syllabification; swapping consonant-graphemes; transposing nuclei; LANGUAGE; MODEL;
D O I
10.1109/ACCESS.2020.3015778
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
An n-gram syllabification model generally produces a high error rate for a low-resource language, such as Indonesian, because of the high rate of out-of-vocabulary (OOV) n-grams. In this paper, a combination of three methods of data augmentations is proposed to solve the problem, namely swapping consonant-graphemes, flipping onsets, and transposing nuclei. An investigation on 50k Indonesian words shows that the combination of three data augmentation methods drastically increases the amount of both unigrams and bigrams. A previous procedure of flipping onsets has been proven to enhance the standard bigram-syllabification by relatively decreasing the syllable error rate (SER) by up to 18.02%. Meanwhile, the previous swapping consonant-graphemes has been proven to give a relative decrement of SER up to 31.39%. In this research, a new transposing nuclei-based augmentation method is proposed and combined with both flipping and swapping procedures to tackle the drawback of bigram syllabification in handling the OOV bigrams. An evaluation based on k-fold cross-validation (k-FCV), using k = 5, for 50 thousand Indonesian formal words concludes that the proposed combination of the three procedures relatively decreases the mean SER produced by the standard bigram model by up to 37.63%. The proposed model is comparable to the fuzzy k-nearest neighbor in every class (FkNNC)-based model. It is worse than the state-of-the-art model, which is developed using a combination of bidirectional long short-term memory (BiLSTM), convolutional neural networks (CNN), and conditional random fields (CRF), but it offers a low complexity.
引用
收藏
页码:147399 / 147406
页数:8
相关论文
共 50 条
  • [31] Domain-Aligned Data Augmentation for Low-Resource and Imbalanced Text Classification
    Stylianou, Nikolaos
    Chatzakou, Despoina
    Tsikrika, Theodora
    Vrochidis, Stefanos
    Kompatsiaris, Ioannis
    ADVANCES IN INFORMATION RETRIEVAL, ECIR 2023, PT II, 2023, 13981 : 172 - 187
  • [32] A systematic comparison of methods for low-resource dependency parsing on genuinely low-resource languages
    Vania, Clara
    Kementchedjhieva, Yova
    Sogaard, Anders
    Lopez, Adam
    2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 1105 - 1116
  • [33] On the scalability of data augmentation techniques for low-resource machine translation between Chinese and Vietnamese
    Vu, Huan
    Bui, Ngoc Dung
    JOURNAL OF INFORMATION AND TELECOMMUNICATION, 2023, 7 (02) : 241 - 253
  • [34] Low-Resource Language Discrimination toward Chinese Dialects with Transfer Learning and Data Augmentation
    Xu, Fan
    Dan, Yangjie
    Yan, Keyu
    Ma, Yong
    Wang, Mingwen
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2022, 21 (02)
  • [35] Improving Loanword Identification in Low-Resource Language with Data Augmentation and Multiple Feature Fusion
    Mi, Chenggang
    Zhu, Shaolin
    Nie, Rui
    COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2021, 2021
  • [36] A Unified Data Augmentation Framework for Low-Resource Multi-domain Dialogue Generation
    Liu, Yongkang
    Nie, Ercong
    Feng, Shi
    Hua, Zheng
    Ding, Zifeng
    Wang, Daling
    Zhang, Yifei
    Schuetze, Hinrich
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES: RESEARCH TRACK, PT II, ECML PKDD 2024, 2024, 14942 : 162 - 177
  • [37] Generative-Adversarial Networks for Low-Resource Language Data Augmentation in Machine Translation
    Zeng, Linda
    2024 6TH INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING, ICNLP 2024, 2024, : 11 - 18
  • [38] Effectiveness of Data Augmentation and Pretraining for Improving Neural Headline Generation in Low-Resource Settings
    Martinc, Matej
    Montariol, Syrielle
    Pivovarova, Lidia
    Zosa, Elaine
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 3561 - 3570
  • [39] Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation
    Bartelds, Martijn
    San, Nay
    McDonnell, Bradley
    Jurafsky, Dan
    Wieling, Martijn
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 715 - 729
  • [40] DISTRIBUTION AUGMENTATION FOR LOW-RESOURCE EXPRESSIVE TEXT-TO-SPEECH
    Lajszczak, Mateusz
    Prasad, Animesh
    van Korlaar, Arent
    Bollepalli, Bajibabu
    Bonafonte, Antonio
    Joly, Arnaud
    Nicolis, Marco
    Moinet, Alexis
    Drugman, Thomas
    Wood, Trevor
    Sokolova, Elena
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8307 - 8311