Understanding Compositional Data Augmentation in Typologically Diverse Morphological Inflection

被引:0
|
作者
Samir, Farhan [1 ]
Silfverberg, Miikka [1 ]
机构
[1] Univ British Columbia, Nat Language Proc Grp, Vancouver, BC, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data augmentation techniques are widely used in low-resource automatic morphological inflection to overcome data sparsity. However, the full implications of these techniques remain poorly understood. In this study, we aim to shed light on the theoretical aspects of the prominent data augmentation strategy STEMCORRUPT (Silfverberg et al., 2017; Anastasopoulos and Neubig, 2019), a method that generates synthetic examples by randomly substituting stem characters in gold standard training examples. To begin, we conduct an information-theoretic analysis, arguing that STEMCORRUPT improves compositional generalization by eliminating spurious correlations between morphemes, specifically between the stem and the affixes. Our theoretical analysis further leads us to study the sample-efficiency with which STEMCORRUPT reduces these spurious correlations. Through evaluation across seven typologically distinct languages, we demonstrate that selecting a subset of datapoints with both high diversity and high predictive uncertainty significantly enhances the data-efficiency of STEMCORRUPT. However, we also explore the impact of typological features on the choice of the data selection strategy and find that languages incorporating a high degree of allomorphy and phonological alternations derive less benefit from synthetic examples with high uncertainty. We attribute this effect to phonotactic violations induced by STEMCORRUPT, emphasizing the need for further research to ensure optimal performance across the entire spectrum of natural language morphology.(1)
引用
收藏
页码:277 / 291
页数:15
相关论文
共 50 条
  • [1] Exploring Neural Architectures And Techniques For Typologically Diverse Morphological Inflection
    Jayarao, Pratik
    Pillay, Siddhanth
    Thombre, Pranav
    Chaudhary, Aditi
    17TH SIGMORPHON WORKSHOP ON COMPUTATIONAL RESEARCH IN PHONETICS PHONOLOGY, AND MORPHOLOGY (SIGMORPHON 2020), 2020, : 128 - 136
  • [2] SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection
    Vylomova, Ekaterina
    White, Jennifer
    Salesky, Elizabeth
    Mielke, Sabrina J.
    Wu, Shijie
    Ponti, Edoardo
    Maudslay, Rowan Hall
    Zmigrod, Ran
    Valvoda, Josef
    Toldova, Svetlana
    Tyers, Francis
    Klyachko, Elena
    Yegorov, Ilya
    Krizhanovsky, Natalia
    Czarnowska, Paula
    Nikkarinen, Irene
    Krizhanovsky, Andrew
    Pimentel, Tiago
    Hennigen, Lucas Torroba
    Kirov, Christo
    Nicolai, Garrett
    Williams, Adina
    Anastasopoulos, Antonios
    Cruz, Hilaria
    Chodroff, Eleanor
    Cotterell, Ryan
    Silfverberg, Miikka
    Hulden, Mans
    17TH SIGMORPHON WORKSHOP ON COMPUTATIONAL RESEARCH IN PHONETICS PHONOLOGY, AND MORPHOLOGY (SIGMORPHON 2020), 2020, : 1 - 39
  • [3] The UniMelb Submission to the SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection
    Shcherbakov, Andrei
    17TH SIGMORPHON WORKSHOP ON COMPUTATIONAL RESEARCH IN PHONETICS PHONOLOGY, AND MORPHOLOGY (SIGMORPHON 2020), 2020, : 177 - 183
  • [4] SIGMORPHON-UniMorph 2023 Shared Task 0: Typologically Diverse Morphological Inflection
    Goldman, Omer
    Batsuren, Khuyagbaatar
    Khalifa, Salam
    Arora, Aryaman
    Nicolai, Garrett
    Tsarfaty, Reut
    Vylomova, Ekaterina
    Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2023, : 117 - 125
  • [5] University of Illinois Submission to the SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection
    Canby, Marc E.
    Karipbayeva, Aidana
    Lunt, Bryan J.
    Mozaffari, Sahand
    Yoder, Charlotte R.
    Hockenmaier, Julia
    17TH SIGMORPHON WORKSHOP ON COMPUTATIONAL RESEARCH IN PHONETICS PHONOLOGY, AND MORPHOLOGY (SIGMORPHON 2020), 2020, : 137 - 145
  • [6] SIGMORPHON-UniMorph 2022 Shared Task 0: Generalization and Typologically Diverse Morphological Inflection
    Kodner, Jordan
    Khalifa, Salam
    Batsuren, Khuyagbaatar
    Dolatian, Hossep
    Cotterell, Ryan
    Akkuş, Faruk
    Anastasopoulos, Antonios
    Andrushko, Taras
    Arora, Aryaman
    Bella, Nona Atanelov Gábor
    Budianskaya, Elena
    Ghanggo Ate, Yustinus
    Goldman, Omer
    Guriel, David
    Guriel, Simon
    Guriel-Agiashvili, Silvia
    Kieraś, Witold
    Krizhanovsky, Andrew
    Krizhanovsky, Natalia
    Marchenko, Igor
    Markowska, Magdalena
    Mashkovtseva, Polina
    Nepomniashchaya, Maria
    Rodionova, Daria
    Sheifer, Karina
    Serova, Alexandra
    Yemelina, Anastasia
    Young, Jeremiah
    Vylomova, Ekaterina
    SIGMORPHON 2022 - 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, Proceedings of the Workshop, 2022, : 176 - 203
  • [7] Rapid Development of Morphological Analyzers for Typologically Diverse Languages
    Kulick, Seth
    Bies, Ann
    LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2016, : 2551 - 2557
  • [8] Getting More Data for Low-resource Morphological Inflection: Language Models and Data Augmentation
    Sorokin, Alexey
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 3978 - 3983
  • [9] TreeMix: Compositional Constituency-based Data Augmentation for Natural Language Understanding
    Zhang, Le
    Yang, Zichao
    Yang, Diyi
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 5243 - 5258
  • [10] Cross-lingual Inflection as a Data Augmentation Method for Parsing
    Munoz-Ortiz, Alberto
    Gomez-Rodriguez, Carlos
    Vilares, David
    PROCEEDINGS OF THE THIRD WORKSHOP ON INSIGHTS FROM NEGATIVE RESULTS IN NLP (INSIGHTS 2022), 2022, : 54 - 61