Morphologically Motivated Input Variations and Data Augmentation in Turkish-English Neural Machine Translation

被引:0
|
作者
Yirmibesoglu, Zeynep [1 ]
Gungor, Tunga [1 ]
机构
[1] Bogazici Univ, Comp Engn, Istanbul, Turkiye
关键词
Neural machine translation; morphology; low-resource; Transformer; encoder-decoder; attention; data augmentation; word segmentation;
D O I
10.1145/3571073
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Success of neural networks in natural language processing has paved the way for neural machine translation (NMT), which rapidly became the mainstream approach in machine translation. Significant improvement in translation performance has been achieved with breakthroughs such as encoder-decoder networks, attention mechanism, and Transformer architecture. However, the necessity of large amounts of parallel data for training an NMT system and rare words in translation corpora are issues yet to be overcome. In this article, we approach NMT of the low-resource Turkish-English language pair. We employ state-of-the-art NMT architectures and data augmentationmethods that exploit monolingual corpora. We point out the importance of input representation for the morphologically rich Turkish language and make a comprehensive analysis of linguistically and non-linguistically motivated input segmentation approaches. We prove the effectiveness of morphologically motivated input segmentation for the Turkish language. Moreover, we show the superiority of the Transformer architecture over attentional encoder-decoder models for the Turkish-English language pair. Among the employed data augmentation approaches, we observe back-translation to be the most effective and confirm the benefit of increasing the amount of parallel data on translation quality. This research demonstrates a comprehensive analysis on NMT architectures with different hyperparameters, data augmentation methods, and input representation techniques, and proposes ways of tackling the low-resource setting of Turkish-English NMT.
引用
收藏
页数:31
相关论文
共 50 条
  • [41] Towards a Better Integration of Fuzzy Matches in Neural Machine Translation through Data Augmentation
    Tezcan, Arda
    Bulte, Bram
    Vanroy, Bram
    INFORMATICS-BASEL, 2021, 8 (01):
  • [42] Khmer-Vietnamese Neural Machine Translation Improvement Using Data Augmentation Strategies
    Quoc T.N.
    Thanh H.L.
    Van H.P.
    Informatica (Slovenia), 2023, 47 (03): : 349 - 360
  • [43] STA: An efficient data augmentation method for low-resource neural machine translation
    Li, Fuxue
    Chi, Chuncheng
    Yan, Hong
    Liu, Beibei
    Shao, Mingzhi
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2023, 45 (01) : 121 - 132
  • [44] A Bilingual Templates Data Augmentation Method for Low-Resource Neural Machine Translation
    Li, Fuxue
    Liu, Beibei
    Yan, Hong
    Shao, Mingzhi
    Xie, Peijun
    Li, Jiarui
    Chi, Chuncheng
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT III, ICIC 2024, 2024, 14877 : 40 - 51
  • [45] Importance-Aware Data Augmentation for Document-Level Neural Machine Translation
    Wu, Minghao
    Wang, Yufei
    Foster, George
    Qiu, Lizhen
    Haffari, Gholamreza
    PROCEEDINGS OF THE 18TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 740 - 752
  • [46] SENTENCE BOUNDARY AUGMENTATION FOR NEURAL MACHINE TRANSLATION ROBUSTNESS
    Li, Daniel
    Te, I
    Arivazhagan, Naveen
    Cherry, Colin
    Padfield, Dirk
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7553 - 7557
  • [47] Utilizing Knowledge Graphs for Neural Machine Translation Augmentation
    Moussallem, Diego
    Ngomo, Axel-Cyrille Ngonga
    Buitelaar, Paul
    Arcan, Mihael
    PROCEEDINGS OF THE 10TH INTERNATIONAL CONFERENCE ON KNOWLEDGE CAPTURE (K-CAP '19), 2019, : 139 - 146
  • [48] Neural Machine Translation for Morphologically Rich Languages with Improved Sub-word Units and Synthetic Data
    Pinnis, Marcis
    Krislauks, Rihards
    Deksne, Daiga
    Miks, Toms
    TEXT, SPEECH, AND DIALOGUE, TSD 2017, 2017, 10415 : 237 - 245
  • [49] Translating Between Morphologically Rich Languages: An Arabic-to-Turkish Machine Translation System
    El-Kahlout, Ilknur Durgar
    Bektas, Emre
    Erdem, Naime Seyma
    Kaya, Hamza
    FOURTH ARABIC NATURAL LANGUAGE PROCESSING WORKSHOP (WANLP 2019), 2019, : 158 - 166
  • [50] Neural Machine Translation for English-Kazakh with Morphological Segmentation and Synthetic Data
    Toral, Antonio
    Edman, Lukas
    Yeshmagambetova, Galiya
    Spenader, Jennifer
    FOURTH CONFERENCE ON MACHINE TRANSLATION (WMT 2019), 2019, : 386 - 392