Understanding Compositional Data Augmentation in Typologically Diverse Morphological Inflection

被引:0
|
作者
Samir, Farhan [1 ]
Silfverberg, Miikka [1 ]
机构
[1] Univ British Columbia, Nat Language Proc Grp, Vancouver, BC, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data augmentation techniques are widely used in low-resource automatic morphological inflection to overcome data sparsity. However, the full implications of these techniques remain poorly understood. In this study, we aim to shed light on the theoretical aspects of the prominent data augmentation strategy STEMCORRUPT (Silfverberg et al., 2017; Anastasopoulos and Neubig, 2019), a method that generates synthetic examples by randomly substituting stem characters in gold standard training examples. To begin, we conduct an information-theoretic analysis, arguing that STEMCORRUPT improves compositional generalization by eliminating spurious correlations between morphemes, specifically between the stem and the affixes. Our theoretical analysis further leads us to study the sample-efficiency with which STEMCORRUPT reduces these spurious correlations. Through evaluation across seven typologically distinct languages, we demonstrate that selecting a subset of datapoints with both high diversity and high predictive uncertainty significantly enhances the data-efficiency of STEMCORRUPT. However, we also explore the impact of typological features on the choice of the data selection strategy and find that languages incorporating a high degree of allomorphy and phonological alternations derive less benefit from synthetic examples with high uncertainty. We attribute this effect to phonotactic violations induced by STEMCORRUPT, emphasizing the need for further research to ensure optimal performance across the entire spectrum of natural language morphology.(1)
引用
收藏
页码:277 / 291
页数:15
相关论文
共 50 条
  • [31] CNN-SVM for Microvascular Morphological Type Recognition with Data Augmentation
    Xue, Di-Xiu
    Zhang, Rong
    Feng, Hui
    Wang, Ya-Lei
    JOURNAL OF MEDICAL AND BIOLOGICAL ENGINEERING, 2016, 36 (06) : 755 - 764
  • [32] Data augmentation by morphological mixup for solving Raven's progressive matrices
    He, Wentao
    Ren, Jianfeng
    Bai, Ruibin
    VISUAL COMPUTER, 2024, 40 (04): : 2457 - 2470
  • [33] CNN-SVM for Microvascular Morphological Type Recognition with Data Augmentation
    Di-Xiu Xue
    Rong Zhang
    Hui Feng
    Ya-Lei Wang
    Journal of Medical and Biological Engineering, 2016, 36 : 755 - 764
  • [34] Data augmentation by morphological mixup for solving Raven’s progressive matrices
    Wentao He
    Jianfeng Ren
    Ruibin Bai
    The Visual Computer, 2024, 40 : 2457 - 2470
  • [35] Data Augmentation for Morphological Analysis of Histopathological Images Using Deep Learning
    Tabakov, Martin
    Karanowski, Konrad
    Chlopowiec, Adam R.
    Chlopowiec, Adrian B.
    Kasperek, Mikolaj
    COMPUTATIONAL COLLECTIVE INTELLIGENCE, ICCCI 2022, 2022, 13501 : 95 - 105
  • [36] A Diverse Data Augmentation Strategy for Low-Resource Neural Machine Translation
    Li, Yu
    Li, Xiao
    Yang, Yating
    Dong, Rui
    INFORMATION, 2020, 11 (05)
  • [37] A Relevant and Diverse Retrieval-enhanced Data Augmentation Framework for Sequential Recommendation
    Bian, Shuqing
    Zhao, Wayne Xin
    Wang, Jinpeng
    Wen, Ji-Rong
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2022, 2022, : 2923 - 2932
  • [38] Submodular Optimization-based Diverse Paraphrasing and its Effectiveness in Data Augmentation
    Kumar, Ashutosh
    Bhattamishra, Satwik
    Bhandari, Manik
    Talukdar, Partha
    2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, 2019, : 3609 - 3619
  • [39] Diverse data augmentation for learning image segmentation with cross-modality annotations
    Chen, Xu
    Lian, Chunfeng
    Wang, Li
    Deng, Hannah
    Kuang, Tianshu
    Fung, Steve H.
    Gateno, Jaime
    Shen, Dinggang
    Xia, James J.
    Yap, Pew-Thian
    MEDICAL IMAGE ANALYSIS, 2021, 71
  • [40] Diverse Data Augmentation with Diffusions for Effective Test-time Prompt Tuning
    Feng, Chun-Mei
    Yu, Kai
    Liu, Yong
    Khan, Salman
    Zuo, Wangmeng
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2704 - 2714