Understanding Compositional Data Augmentation in Typologically Diverse Morphological Inflection

被引:0
|
作者
Samir, Farhan [1 ]
Silfverberg, Miikka [1 ]
机构
[1] Univ British Columbia, Nat Language Proc Grp, Vancouver, BC, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data augmentation techniques are widely used in low-resource automatic morphological inflection to overcome data sparsity. However, the full implications of these techniques remain poorly understood. In this study, we aim to shed light on the theoretical aspects of the prominent data augmentation strategy STEMCORRUPT (Silfverberg et al., 2017; Anastasopoulos and Neubig, 2019), a method that generates synthetic examples by randomly substituting stem characters in gold standard training examples. To begin, we conduct an information-theoretic analysis, arguing that STEMCORRUPT improves compositional generalization by eliminating spurious correlations between morphemes, specifically between the stem and the affixes. Our theoretical analysis further leads us to study the sample-efficiency with which STEMCORRUPT reduces these spurious correlations. Through evaluation across seven typologically distinct languages, we demonstrate that selecting a subset of datapoints with both high diversity and high predictive uncertainty significantly enhances the data-efficiency of STEMCORRUPT. However, we also explore the impact of typological features on the choice of the data selection strategy and find that languages incorporating a high degree of allomorphy and phonological alternations derive less benefit from synthetic examples with high uncertainty. We attribute this effect to phonotactic violations induced by STEMCORRUPT, emphasizing the need for further research to ensure optimal performance across the entire spectrum of natural language morphology.(1)
引用
收藏
页码:277 / 291
页数:15
相关论文
共 50 条
  • [41] Implications of molecular and morphological data for understanding ateline phylogeny
    Hartwig, W
    INTERNATIONAL JOURNAL OF PRIMATOLOGY, 2005, 26 (05) : 999 - 1015
  • [42] Implications of Molecular and Morphological Data for Understanding Ateline Phylogeny
    Walter Hartwig
    International Journal of Primatology, 2005, 26 : 999 - 1015
  • [43] Compositional and morphological analysis of high resolution remote sensing data over central peak of Tycho crater on the Moon: implications for understanding lunar interior
    Chauhan, Prakash
    Kaur, Prabhjot
    Srivastava, Neeraj
    Bhattacharya, Satadru
    Ajai
    Kumar, A. S. Kiran
    Goswami, J. N.
    CURRENT SCIENCE, 2012, 102 (07): : 1041 - 1046
  • [44] Understanding Data Usage Patterns of Geographically Diverse Mobile Users
    Walelgne, Ermias Andargie
    Asrese, Alemnew Sheferaw
    Manner, Jukka
    Bajpai, Vaibhav
    Ott, Jorg
    IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, 2021, 18 (03): : 3798 - 3812
  • [45] Data Augmentation by Data Noising for Open-vocabulary Slots in Spoken Language Understanding
    Kim, Hwa-Yeon
    Roh, Yoon-Hyung
    Kim, Young-Kil
    NAACL HLT 2019: THE 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES: PROCEEDINGS OF THE STUDENT RESEARCH WORKSHOP, 2019, : 97 - 102
  • [46] Data Augmentation for Spoken Language Understanding via Pretrained Language Models
    Peng, Baolin
    Zhu, Chenguang
    Zeng, Michael
    Gao, Jianfeng
    INTERSPEECH 2021, 2021, : 1219 - 1223
  • [47] Data Augmentation for Spoken Language Understanding via Joint Variational Generation
    Yoo, Kang Min
    Shin, Youhyun
    Lee, Sang-Goo
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 7402 - 7409
  • [48] HiddenCut: Simple Data Augmentation for Natural Language Understanding with Better Generalization
    Chen, Jiaao
    Shen, Dinghan
    Chen, Weizhu
    Yang, Diyi
    59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, 2021, : 4380 - 4390
  • [49] Improving data augmentation for low resource speech-to-text translation with diverse paraphrasing
    Mi, Chenggang
    Xie, Lei
    Zhang, Yanning
    NEURAL NETWORKS, 2022, 148 : 194 - 205
  • [50] Understanding episode mining techniques: Benchmarking on diverse, realistic, artificial data
    Zimmermann, Albrecht
    INTELLIGENT DATA ANALYSIS, 2014, 18 (05) : 761 - 791