Data Augmentation Methods for Low-Resource Orthographic Syllabification

被引：3

作者：

Suyanto, Suyanto ^{[1
]}

Lhaksmana, Kemas M. ^{[1
]}

Bijaksana, Moch Arif ^{[1
]}

Kurniawan, Adriana ^{[1
]}

机构：

[1] Telkom Univ, Sch Comp, Bandung 40257, Indonesia

来源：

IEEE ACCESS | 2020年 / 8卷

关键词：

Indonesian; flipping onsets; orthographic syllabification; swapping consonant-graphemes; transposing nuclei; LANGUAGE; MODEL;

D O I：

10.1109/ACCESS.2020.3015778

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

An n-gram syllabification model generally produces a high error rate for a low-resource language, such as Indonesian, because of the high rate of out-of-vocabulary (OOV) n-grams. In this paper, a combination of three methods of data augmentations is proposed to solve the problem, namely swapping consonant-graphemes, flipping onsets, and transposing nuclei. An investigation on 50k Indonesian words shows that the combination of three data augmentation methods drastically increases the amount of both unigrams and bigrams. A previous procedure of flipping onsets has been proven to enhance the standard bigram-syllabification by relatively decreasing the syllable error rate (SER) by up to 18.02%. Meanwhile, the previous swapping consonant-graphemes has been proven to give a relative decrement of SER up to 31.39%. In this research, a new transposing nuclei-based augmentation method is proposed and combined with both flipping and swapping procedures to tackle the drawback of bigram syllabification in handling the OOV bigrams. An evaluation based on k-fold cross-validation (k-FCV), using k = 5, for 50 thousand Indonesian formal words concludes that the proposed combination of the three procedures relatively decreases the mean SER produced by the standard bigram model by up to 37.63%. The proposed model is comparable to the fuzzy k-nearest neighbor in every class (FkNNC)-based model. It is worse than the state-of-the-art model, which is developed using a combination of bidirectional long short-term memory (BiLSTM), convolutional neural networks (CNN), and conditional random fields (CRF), but it offers a low complexity.

引用

页码：147399 / 147406

页数：8

共 50 条

[31] Domain-Aligned Data Augmentation for Low-Resource and Imbalanced Text Classification
Stylianou, Nikolaos
Chatzakou, Despoina
Tsikrika, Theodora
Vrochidis, Stefanos
Kompatsiaris, Ioannis
ADVANCES IN INFORMATION RETRIEVAL, ECIR 2023, PT II, 2023, 13981 : 172 - 187
[32] A systematic comparison of methods for low-resource dependency parsing on genuinely low-resource languages
Vania, Clara
Kementchedjhieva, Yova
Sogaard, Anders
Lopez, Adam
2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 1105 - 1116
[33] On the scalability of data augmentation techniques for low-resource machine translation between Chinese and Vietnamese
Vu, Huan
Bui, Ngoc Dung
JOURNAL OF INFORMATION AND TELECOMMUNICATION, 2023, 7 (02) : 241 - 253
[34] Low-Resource Language Discrimination toward Chinese Dialects with Transfer Learning and Data Augmentation
Xu, Fan
Dan, Yangjie
Yan, Keyu
Ma, Yong
Wang, Mingwen
ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2022, 21 (02)
[35] Improving Loanword Identification in Low-Resource Language with Data Augmentation and Multiple Feature Fusion
Mi, Chenggang
Zhu, Shaolin
Nie, Rui
COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2021, 2021
[36] A Unified Data Augmentation Framework for Low-Resource Multi-domain Dialogue Generation
Liu, Yongkang
Nie, Ercong
Feng, Shi
Hua, Zheng
Ding, Zifeng
Wang, Daling
Zhang, Yifei
Schuetze, Hinrich
MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES: RESEARCH TRACK, PT II, ECML PKDD 2024, 2024, 14942 : 162 - 177
[37] Generative-Adversarial Networks for Low-Resource Language Data Augmentation in Machine Translation
Zeng, Linda
2024 6TH INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING, ICNLP 2024, 2024, : 11 - 18
[38] Effectiveness of Data Augmentation and Pretraining for Improving Neural Headline Generation in Low-Resource Settings
Martinc, Matej
Montariol, Syrielle
Pivovarova, Lidia
Zosa, Elaine
LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 3561 - 3570
[39] Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation
Bartelds, Martijn
San, Nay
McDonnell, Bradley
Jurafsky, Dan
Wieling, Martijn
PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 715 - 729
[40] DISTRIBUTION AUGMENTATION FOR LOW-RESOURCE EXPRESSIVE TEXT-TO-SPEECH
Lajszczak, Mateusz
Prasad, Animesh
van Korlaar, Arent
Bollepalli, Bajibabu
Bonafonte, Antonio
Joly, Arnaud
Nicolis, Marco
Moinet, Alexis
Drugman, Thomas
Wood, Trevor
Sokolova, Elena
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8307 - 8311

← 1 2 3 4 5 →