Multimodal Seed Data Augmentation for Low-Resource Audio Latin Cuengh Language

被引：0

作者：

Jiang, Lanlan ^{[1
]}

Qin, Xingguo ^{[2
]}

Zhang, Jingwei ^{[2
]}

Li, Jun ^{[2
]}

机构：

[1] Guilin Univ Elect Technol, Sch Business, Guilin 541004, Peoples R China

[2] Guilin Univ Elect Technol, Sch Comp Sci & Informat Secur, Guilin 541004, Peoples R China

来源：

APPLIED SCIENCES-BASEL | 2024年 / 14卷 / 20期

基金：

中国国家自然科学基金;

关键词：

seed data augmentation; low-resource data; Latin Cuengh language; multimodal;

D O I：

10.3390/app14209533

中图分类号：

O6 [化学];

学科分类号：

0703 ;

摘要：

Latin Cuengh is a low-resource dialect that is prevalent in select ethnic minority regions in China. This language presents unique challenges for intelligent research and preservation efforts, primarily due to its oral tradition and the limited availability of textual resources. Prior research has sought to bolster intelligent processing capabilities with regard to Latin Cuengh through data augmentation techniques leveraging scarce textual data, with modest success. In this study, we introduce an innovative multimodal seed data augmentation model designed to significantly enhance the intelligent recognition and comprehension of this dialect. After supplementing the pre-trained model with extensive speech data, we fine-tune its performance with a modest corpus of multilingual textual seed data, employing both Latin Cuengh and Chinese texts as bilingual seed data to enrich its multilingual properties. We then refine its parameters through a variety of downstream tasks. The proposed model achieves a commendable performance across both multi-classification and binary classification tasks, with its average accuracy and F1 measure increasing by more than 3%. Moreover, the model's training efficiency is substantially ameliorated through strategic seed data augmentation. Our research provides insights into the informatization of low-resource languages and contributes to their dissemination and preservation.

引用

页数：13

共 50 条

[21] DAGA: Data Augmentation with a Generation Approach for Low-resource Tagging Tasks
Ding, Bosheng
Liu, Linlin
Bing, Lidong
Kruengkrai, Canasai
Nguyen, Thien Hai
Joty, Shafiq
Si, Luo
Miao, Chunyan
PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 6045 - 6057
[22] Adversarial Word Dilution as Text Data Augmentation in Low-Resource Regime
Chen, Junfan
Zhang, Richong
Luo, Zheyan
Hu, Chunming
Mao, Yongyi
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 11, 2023, : 12626 - 12634
[23] Examining Sentiment Analysis for Low-Resource Languages with Data Augmentation Techniques
Thakkar, Gaurish
Preradovic, Nives Mikelic
Tadic, Marko
ENG, 2024, 5 (04): : 2920 - 2942
[24] LOW-RESOURCE EXPRESSIVE TEXT-TO-SPEECH USING DATA AUGMENTATION
Huybrechts, Goeric
Merritt, Thomas
Comini, Giulia
Perz, Bartek
Shah, Raahil
Lorenzo-Trueba, Jaime
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6593 - 6597
[25] Low-Resource Comparative Opinion Quintuple Extraction by Data Augmentation with Prompting
Xu, Qingting
Hong, Yu
Zhao, Fubang
Song, Kaisong
Kang, Yangyang
Chen, Jiaxiang
Zhou, Guodong
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 3892 - 3897
[26] Data Augmentation via Dependency Tree Morphing for Low-Resource Languages
Sahin, Goezde Guel
Steedman, Mark
2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 5004 - 5009
[27] Adding Visual Information to Improve Multimodal Machine Translation for Low-Resource Language
Shi, Xiayang
Yu, Zhenqiang
MATHEMATICAL PROBLEMS IN ENGINEERING, 2022, 2022
[28] Text data augmentation and pre-trained Language Model for enhancing text classification of low-resource languages
Ziyaden, Atabay
Yelenov, Amir
Hajiyev, Fuad
Rustamov, Samir
Pak, Alexandr
PEERJ COMPUTER SCIENCE, 2024, 10
[29] Entropy-guided Vocabulary Augmentation of Multilingual Language Models for Low-resource Tasks
Nag, Arijit
Samanta, Bidisha
Mukherjee, Animesh
Ganguly, Niloy
Chakrabarti, Soumen
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 8619 - 8629
[30] BioAug: Conditional Generation based Data Augmentation for Low-Resource Biomedical NER
Ghosh, Sreyan
Tyagi, Utkarsh
Kumar, Sonal
Manocha, Dinesh
PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 1853 - 1858

← 1 2 3 4 5 →