Efficient Data Augmentation via lexical matching for boosting performance on Statistical Machine Translation for Indic and a Low-resource language

被引:0
|
作者
Saxena, Shefali [1 ]
Gupta, Ayush [1 ]
Daniel, Philemon [1 ]
机构
[1] Natl Inst Technol Hamirpur, Dept Elect & Commun Engn, Hamirpur, India
关键词
Data Augmentation; Low-resource language; Machine Translation; Evaluation;
D O I
10.1007/s11042-023-18086-8
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With the fast advancement of AI technology in recent years, many excellent Data Augmentation (DA) approaches have been investigated to increase data efficiency in Natural Language Processing (NLP). The reliance on a large amount of data prohibits NLP models from performing tasks such as labelling enormous amounts of textual data, which require a substantial amount of time, money, and human resources; hence, a better model requires more data. Text DA technique rectifies the data by extending it, enhancing the model's accuracy and resilience. A novel lexical-based matching approach is the cornerstone of this work; it is used to improve the quality of the Machine Translation (MT) system. This study includes resource-rich Indic (i.e., Indo-Aryan and Dravidian language families) to examine the proposed techniques. Extensive experiments on a range of language pairs depict that the proposed method significantly improves scores in the enhanced dataset compared to the baseline system's BLEU, METEOR and ROUGE evaluation scores.
引用
收藏
页码:64255 / 64269
页数:15
相关论文
共 50 条
  • [31] Rethinking the Exploitation of Monolingual Data for Low-Resource Neural Machine Translation
    Pang, Jianhui
    Yang, Baosong
    Wong, Derek Fai
    Wan, Yu
    Liu, Dayiheng
    Chao, Lidia Sam
    Xie, Jun
    COMPUTATIONAL LINGUISTICS, 2023, 50 (01) : 25 - 47
  • [32] Data Augmentation via Dependency Tree Morphing for Low-Resource Languages
    Sahin, Goezde Guel
    Steedman, Mark
    2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 5004 - 5009
  • [33] Getting More Data for Low-resource Morphological Inflection: Language Models and Data Augmentation
    Sorokin, Alexey
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 3978 - 3983
  • [34] Adding Visual Information to Improve Multimodal Machine Translation for Low-Resource Language
    Shi, Xiayang
    Yu, Zhenqiang
    MATHEMATICAL PROBLEMS IN ENGINEERING, 2022, 2022
  • [35] An Efficient Method for Generating Synthetic Data for Low-Resource Machine Translation An empirical study of Chinese, Japanese to Vietnamese Neural Machine Translation
    Thi-Vinh Ngo
    Phuong-Thai Nguyen
    Van Vinh Nguyen
    Thanh-Le Ha
    Le-Minh Nguyen
    APPLIED ARTIFICIAL INTELLIGENCE, 2022, 36 (01)
  • [36] MELM: Data Augmentation with Masked Entity Language Modeling for Low-Resource NER
    Zhou, Ran
    Li, Xin
    He, Ruidan
    Bing, Lidong
    Cambria, Erik
    Si, Luo
    Miao, Chunyan
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 2251 - 2262
  • [37] Multimodal Seed Data Augmentation for Low-Resource Audio Latin Cuengh Language
    Jiang, Lanlan
    Qin, Xingguo
    Zhang, Jingwei
    Li, Jun
    APPLIED SCIENCES-BASEL, 2024, 14 (20):
  • [38] Statistical Machine Translation for Bilingually Low-Resource Scenarios: A Round-Tripping Approach
    Ahmadnia, Benyamin
    Haffari, Gholamreza
    Serrano, Javier
    2018 IEEE 5TH INTERNATIONAL CONGRESS ON INFORMATION SCIENCE AND TECHNOLOGY (IEEE CIST'18), 2018, : 261 - 265
  • [39] Leveraging Additional Resources for Improving Statistical Machine Translation on Asian Low-Resource Languages
    Hai-Long Trieu
    Duc-Vu Tran
    Ittoo, Ashwin
    Le-Minh Nguyen
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2019, 18 (03)
  • [40] Extracting Bilingual Multi-word Expressions for Low-resource Statistical Machine Translation
    Wei, Linyu
    Li, Miao
    Chen, Lei
    Yang, Zhenxin
    Sun, Kai
    Yuan, Man
    PROCEEDINGS OF 2015 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING, 2015, : 21 - 24