Efficient Data Augmentation via lexical matching for boosting performance on Statistical Machine Translation for Indic and a Low-resource language

被引：0

作者：

Saxena, Shefali ^{[1
]}

Gupta, Ayush ^{[1
]}

Daniel, Philemon ^{[1
]}

机构：

[1] Natl Inst Technol Hamirpur, Dept Elect & Commun Engn, Hamirpur, India

来源：

MULTIMEDIA TOOLS AND APPLICATIONS | 2024年 / 83卷 / 24期

关键词：

Data Augmentation; Low-resource language; Machine Translation; Evaluation;

D O I：

10.1007/s11042-023-18086-8

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

With the fast advancement of AI technology in recent years, many excellent Data Augmentation (DA) approaches have been investigated to increase data efficiency in Natural Language Processing (NLP). The reliance on a large amount of data prohibits NLP models from performing tasks such as labelling enormous amounts of textual data, which require a substantial amount of time, money, and human resources; hence, a better model requires more data. Text DA technique rectifies the data by extending it, enhancing the model's accuracy and resilience. A novel lexical-based matching approach is the cornerstone of this work; it is used to improve the quality of the Machine Translation (MT) system. This study includes resource-rich Indic (i.e., Indo-Aryan and Dravidian language families) to examine the proposed techniques. Extensive experiments on a range of language pairs depict that the proposed method significantly improves scores in the enhanced dataset compared to the baseline system's BLEU, METEOR and ROUGE evaluation scores.

引用

页码：64255 / 64269

页数：15

共 50 条

[31] Rethinking the Exploitation of Monolingual Data for Low-Resource Neural Machine Translation
Pang, Jianhui
Yang, Baosong
Wong, Derek Fai
Wan, Yu
Liu, Dayiheng
Chao, Lidia Sam
Xie, Jun
COMPUTATIONAL LINGUISTICS, 2023, 50 (01) : 25 - 47
[32] Data Augmentation via Dependency Tree Morphing for Low-Resource Languages
Sahin, Goezde Guel
Steedman, Mark
2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 5004 - 5009
[33] Getting More Data for Low-resource Morphological Inflection: Language Models and Data Augmentation
Sorokin, Alexey
PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 3978 - 3983
[34] Adding Visual Information to Improve Multimodal Machine Translation for Low-Resource Language
Shi, Xiayang
Yu, Zhenqiang
MATHEMATICAL PROBLEMS IN ENGINEERING, 2022, 2022
[35] An Efficient Method for Generating Synthetic Data for Low-Resource Machine Translation An empirical study of Chinese, Japanese to Vietnamese Neural Machine Translation
Thi-Vinh Ngo
Phuong-Thai Nguyen
Van Vinh Nguyen
Thanh-Le Ha
Le-Minh Nguyen
APPLIED ARTIFICIAL INTELLIGENCE, 2022, 36 (01)
[36] MELM: Data Augmentation with Masked Entity Language Modeling for Low-Resource NER
Zhou, Ran
Li, Xin
He, Ruidan
Bing, Lidong
Cambria, Erik
Si, Luo
Miao, Chunyan
PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 2251 - 2262
[37] Multimodal Seed Data Augmentation for Low-Resource Audio Latin Cuengh Language
Jiang, Lanlan
Qin, Xingguo
Zhang, Jingwei
Li, Jun
APPLIED SCIENCES-BASEL, 2024, 14 (20):
[38] Statistical Machine Translation for Bilingually Low-Resource Scenarios: A Round-Tripping Approach
Ahmadnia, Benyamin
Haffari, Gholamreza
Serrano, Javier
2018 IEEE 5TH INTERNATIONAL CONGRESS ON INFORMATION SCIENCE AND TECHNOLOGY (IEEE CIST'18), 2018, : 261 - 265
[39] Leveraging Additional Resources for Improving Statistical Machine Translation on Asian Low-Resource Languages
Hai-Long Trieu
Duc-Vu Tran
Ittoo, Ashwin
Le-Minh Nguyen
ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2019, 18 (03)
[40] Extracting Bilingual Multi-word Expressions for Low-resource Statistical Machine Translation
Wei, Linyu
Li, Miao
Chen, Lei
Yang, Zhenxin
Sun, Kai
Yuan, Man
PROCEEDINGS OF 2015 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING, 2015, : 21 - 24

← 1 2 3 4 5 →