Multimodal Seed Data Augmentation for Low-Resource Audio Latin Cuengh Language

被引:0
|
作者
Jiang, Lanlan [1 ]
Qin, Xingguo [2 ]
Zhang, Jingwei [2 ]
Li, Jun [2 ]
机构
[1] Guilin Univ Elect Technol, Sch Business, Guilin 541004, Peoples R China
[2] Guilin Univ Elect Technol, Sch Comp Sci & Informat Secur, Guilin 541004, Peoples R China
来源
APPLIED SCIENCES-BASEL | 2024年 / 14卷 / 20期
基金
中国国家自然科学基金;
关键词
seed data augmentation; low-resource data; Latin Cuengh language; multimodal;
D O I
10.3390/app14209533
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Latin Cuengh is a low-resource dialect that is prevalent in select ethnic minority regions in China. This language presents unique challenges for intelligent research and preservation efforts, primarily due to its oral tradition and the limited availability of textual resources. Prior research has sought to bolster intelligent processing capabilities with regard to Latin Cuengh through data augmentation techniques leveraging scarce textual data, with modest success. In this study, we introduce an innovative multimodal seed data augmentation model designed to significantly enhance the intelligent recognition and comprehension of this dialect. After supplementing the pre-trained model with extensive speech data, we fine-tune its performance with a modest corpus of multilingual textual seed data, employing both Latin Cuengh and Chinese texts as bilingual seed data to enrich its multilingual properties. We then refine its parameters through a variety of downstream tasks. The proposed model achieves a commendable performance across both multi-classification and binary classification tasks, with its average accuracy and F1 measure increasing by more than 3%. Moreover, the model's training efficiency is substantially ameliorated through strategic seed data augmentation. Our research provides insights into the informatization of low-resource languages and contributes to their dissemination and preservation.
引用
收藏
页数:13
相关论文
共 50 条
  • [21] DAGA: Data Augmentation with a Generation Approach for Low-resource Tagging Tasks
    Ding, Bosheng
    Liu, Linlin
    Bing, Lidong
    Kruengkrai, Canasai
    Nguyen, Thien Hai
    Joty, Shafiq
    Si, Luo
    Miao, Chunyan
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 6045 - 6057
  • [22] Adversarial Word Dilution as Text Data Augmentation in Low-Resource Regime
    Chen, Junfan
    Zhang, Richong
    Luo, Zheyan
    Hu, Chunming
    Mao, Yongyi
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 11, 2023, : 12626 - 12634
  • [23] Examining Sentiment Analysis for Low-Resource Languages with Data Augmentation Techniques
    Thakkar, Gaurish
    Preradovic, Nives Mikelic
    Tadic, Marko
    ENG, 2024, 5 (04): : 2920 - 2942
  • [24] LOW-RESOURCE EXPRESSIVE TEXT-TO-SPEECH USING DATA AUGMENTATION
    Huybrechts, Goeric
    Merritt, Thomas
    Comini, Giulia
    Perz, Bartek
    Shah, Raahil
    Lorenzo-Trueba, Jaime
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6593 - 6597
  • [25] Low-Resource Comparative Opinion Quintuple Extraction by Data Augmentation with Prompting
    Xu, Qingting
    Hong, Yu
    Zhao, Fubang
    Song, Kaisong
    Kang, Yangyang
    Chen, Jiaxiang
    Zhou, Guodong
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 3892 - 3897
  • [26] Data Augmentation via Dependency Tree Morphing for Low-Resource Languages
    Sahin, Goezde Guel
    Steedman, Mark
    2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 5004 - 5009
  • [27] Adding Visual Information to Improve Multimodal Machine Translation for Low-Resource Language
    Shi, Xiayang
    Yu, Zhenqiang
    MATHEMATICAL PROBLEMS IN ENGINEERING, 2022, 2022
  • [28] Text data augmentation and pre-trained Language Model for enhancing text classification of low-resource languages
    Ziyaden, Atabay
    Yelenov, Amir
    Hajiyev, Fuad
    Rustamov, Samir
    Pak, Alexandr
    PEERJ COMPUTER SCIENCE, 2024, 10
  • [29] Entropy-guided Vocabulary Augmentation of Multilingual Language Models for Low-resource Tasks
    Nag, Arijit
    Samanta, Bidisha
    Mukherjee, Animesh
    Ganguly, Niloy
    Chakrabarti, Soumen
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 8619 - 8629
  • [30] BioAug: Conditional Generation based Data Augmentation for Low-Resource Biomedical NER
    Ghosh, Sreyan
    Tyagi, Utkarsh
    Kumar, Sonal
    Manocha, Dinesh
    PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 1853 - 1858