Understanding Compositional Data Augmentation in Typologically Diverse Morphological Inflection

被引:0
|
作者
Samir, Farhan [1 ]
Silfverberg, Miikka [1 ]
机构
[1] Univ British Columbia, Nat Language Proc Grp, Vancouver, BC, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data augmentation techniques are widely used in low-resource automatic morphological inflection to overcome data sparsity. However, the full implications of these techniques remain poorly understood. In this study, we aim to shed light on the theoretical aspects of the prominent data augmentation strategy STEMCORRUPT (Silfverberg et al., 2017; Anastasopoulos and Neubig, 2019), a method that generates synthetic examples by randomly substituting stem characters in gold standard training examples. To begin, we conduct an information-theoretic analysis, arguing that STEMCORRUPT improves compositional generalization by eliminating spurious correlations between morphemes, specifically between the stem and the affixes. Our theoretical analysis further leads us to study the sample-efficiency with which STEMCORRUPT reduces these spurious correlations. Through evaluation across seven typologically distinct languages, we demonstrate that selecting a subset of datapoints with both high diversity and high predictive uncertainty significantly enhances the data-efficiency of STEMCORRUPT. However, we also explore the impact of typological features on the choice of the data selection strategy and find that languages incorporating a high degree of allomorphy and phonological alternations derive less benefit from synthetic examples with high uncertainty. We attribute this effect to phonotactic violations induced by STEMCORRUPT, emphasizing the need for further research to ensure optimal performance across the entire spectrum of natural language morphology.(1)
引用
收藏
页码:277 / 291
页数:15
相关论文
共 50 条
  • [21] Augmentation of Understanding in Clinical Practise and Big Data Analytics
    van de Kerkhof, Peter C. M.
    DERMATOLOGY, 2019, 235 (03) : 253 - 254
  • [22] On the Importance of Visual Context for Data Augmentation in Scene Understanding
    Dvornik, Nikita
    Mairal, Julien
    Schmid, Cordelia
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (06) : 2014 - 2028
  • [23] Data Augmentation with Atomic Templates for Spoken Language Understanding
    Zhao, Zijian
    Zhu, Su
    Yu, Kai
    2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 3637 - 3643
  • [24] MinimalGAN: diverse medical image synthesis for data augmentation using minimal training data
    Yipeng Zhang
    Quan Wang
    Bingliang Hu
    Applied Intelligence, 2023, 53 : 3899 - 3916
  • [25] MinimalGAN: diverse medical image synthesis for data augmentation using minimal training data
    Zhang, Yipeng
    Wang, Quan
    Hu, Bingliang
    APPLIED INTELLIGENCE, 2023, 53 (04) : 3899 - 3916
  • [26] Understanding the Detrimental Class-level Effects of Data Augmentation
    Kirichenko, Polina
    Ibrahim, Mark
    Balestriero, Randall
    Bouchacourt, Diane
    Vedantam, Ramakrishna
    Firooz, Hamed
    Wilson, Andrew Gordon
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [27] Data Augmentation by Prompt Tuning on Natural Language Understanding Tasks
    Wang, Yu-Hao
    Chang, Chia-Ming
    Tsai, Yi-Hang
    Hwang, San-Yih
    2024 11TH INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS-TAIWAN, ICCE-TAIWAN 2024, 2024, : 807 - 808
  • [28] Compositional Generalization for Multi-Label Text Classification: A Data-Augmentation Approach
    Chai, Yuyang
    Li, Zhuang
    Liu, Jiahui
    Chen, Lei
    Li, Fei
    Ji, Donghong
    Teng, Chong
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 17727 - 17735
  • [29] Mechanical assessment of defects in welded joints: morphological classification and data augmentation
    Launay, Hugo
    Willot, Francois
    Ryckelynck, David
    Besson, Jacques
    JOURNAL OF MATHEMATICS IN INDUSTRY, 2021, 11 (01)
  • [30] Mechanical assessment of defects in welded joints: morphological classification and data augmentation
    Hugo Launay
    François Willot
    David Ryckelynck
    Jacques Besson
    Journal of Mathematics in Industry, 11