Understanding Compositional Data Augmentation in Typologically Diverse Morphological Inflection

被引：0

作者：

Samir, Farhan ^{[1
]}

Silfverberg, Miikka ^{[1
]}

机构：

[1] Univ British Columbia, Nat Language Proc Grp, Vancouver, BC, Canada

来源：

2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023 | 2023年

基金：

加拿大自然科学与工程研究理事会;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Data augmentation techniques are widely used in low-resource automatic morphological inflection to overcome data sparsity. However, the full implications of these techniques remain poorly understood. In this study, we aim to shed light on the theoretical aspects of the prominent data augmentation strategy STEMCORRUPT (Silfverberg et al., 2017; Anastasopoulos and Neubig, 2019), a method that generates synthetic examples by randomly substituting stem characters in gold standard training examples. To begin, we conduct an information-theoretic analysis, arguing that STEMCORRUPT improves compositional generalization by eliminating spurious correlations between morphemes, specifically between the stem and the affixes. Our theoretical analysis further leads us to study the sample-efficiency with which STEMCORRUPT reduces these spurious correlations. Through evaluation across seven typologically distinct languages, we demonstrate that selecting a subset of datapoints with both high diversity and high predictive uncertainty significantly enhances the data-efficiency of STEMCORRUPT. However, we also explore the impact of typological features on the choice of the data selection strategy and find that languages incorporating a high degree of allomorphy and phonological alternations derive less benefit from synthetic examples with high uncertainty. We attribute this effect to phonotactic violations induced by STEMCORRUPT, emphasizing the need for further research to ensure optimal performance across the entire spectrum of natural language morphology.(1)

引用

页码：277 / 291

页数：15

共 50 条

[41] Implications of molecular and morphological data for understanding ateline phylogeny
Hartwig, W
INTERNATIONAL JOURNAL OF PRIMATOLOGY, 2005, 26 (05) : 999 - 1015
[42] Implications of Molecular and Morphological Data for Understanding Ateline Phylogeny
Walter Hartwig
International Journal of Primatology, 2005, 26 : 999 - 1015
[43] Compositional and morphological analysis of high resolution remote sensing data over central peak of Tycho crater on the Moon: implications for understanding lunar interior
Chauhan, Prakash
Kaur, Prabhjot
Srivastava, Neeraj
Bhattacharya, Satadru
Ajai
Kumar, A. S. Kiran
Goswami, J. N.
CURRENT SCIENCE, 2012, 102 (07): : 1041 - 1046
[44] Understanding Data Usage Patterns of Geographically Diverse Mobile Users
Walelgne, Ermias Andargie
Asrese, Alemnew Sheferaw
Manner, Jukka
Bajpai, Vaibhav
Ott, Jorg
IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, 2021, 18 (03): : 3798 - 3812
[45] Data Augmentation by Data Noising for Open-vocabulary Slots in Spoken Language Understanding
Kim, Hwa-Yeon
Roh, Yoon-Hyung
Kim, Young-Kil
NAACL HLT 2019: THE 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES: PROCEEDINGS OF THE STUDENT RESEARCH WORKSHOP, 2019, : 97 - 102
[46] Data Augmentation for Spoken Language Understanding via Pretrained Language Models
Peng, Baolin
Zhu, Chenguang
Zeng, Michael
Gao, Jianfeng
INTERSPEECH 2021, 2021, : 1219 - 1223
[47] Data Augmentation for Spoken Language Understanding via Joint Variational Generation
Yoo, Kang Min
Shin, Youhyun
Lee, Sang-Goo
THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 7402 - 7409
[48] HiddenCut: Simple Data Augmentation for Natural Language Understanding with Better Generalization
Chen, Jiaao
Shen, Dinghan
Chen, Weizhu
Yang, Diyi
59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, 2021, : 4380 - 4390
[49] Improving data augmentation for low resource speech-to-text translation with diverse paraphrasing
Mi, Chenggang
Xie, Lei
Zhang, Yanning
NEURAL NETWORKS, 2022, 148 : 194 - 205
[50] Understanding episode mining techniques: Benchmarking on diverse, realistic, artificial data
Zimmermann, Albrecht
INTELLIGENT DATA ANALYSIS, 2014, 18 (05) : 761 - 791

← 1 2 3 4 5 →