Multimodal Seed Data Augmentation for Low-Resource Audio Latin Cuengh Language

被引:0
|
作者
Jiang, Lanlan [1 ]
Qin, Xingguo [2 ]
Zhang, Jingwei [2 ]
Li, Jun [2 ]
机构
[1] Guilin Univ Elect Technol, Sch Business, Guilin 541004, Peoples R China
[2] Guilin Univ Elect Technol, Sch Comp Sci & Informat Secur, Guilin 541004, Peoples R China
来源
APPLIED SCIENCES-BASEL | 2024年 / 14卷 / 20期
基金
中国国家自然科学基金;
关键词
seed data augmentation; low-resource data; Latin Cuengh language; multimodal;
D O I
10.3390/app14209533
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Latin Cuengh is a low-resource dialect that is prevalent in select ethnic minority regions in China. This language presents unique challenges for intelligent research and preservation efforts, primarily due to its oral tradition and the limited availability of textual resources. Prior research has sought to bolster intelligent processing capabilities with regard to Latin Cuengh through data augmentation techniques leveraging scarce textual data, with modest success. In this study, we introduce an innovative multimodal seed data augmentation model designed to significantly enhance the intelligent recognition and comprehension of this dialect. After supplementing the pre-trained model with extensive speech data, we fine-tune its performance with a modest corpus of multilingual textual seed data, employing both Latin Cuengh and Chinese texts as bilingual seed data to enrich its multilingual properties. We then refine its parameters through a variety of downstream tasks. The proposed model achieves a commendable performance across both multi-classification and binary classification tasks, with its average accuracy and F1 measure increasing by more than 3%. Moreover, the model's training efficiency is substantially ameliorated through strategic seed data augmentation. Our research provides insights into the informatization of low-resource languages and contributes to their dissemination and preservation.
引用
收藏
页数:13
相关论文
共 50 条
  • [11] Data Augmentation for Low-Resource Neural Machine Translation
    Fadaee, Marzieh
    Bisazza, Arianna
    Monz, Christof
    PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 2, 2017, : 567 - 573
  • [12] Data Augmentation Methods for Low-Resource Orthographic Syllabification
    Suyanto, Suyanto
    Lhaksmana, Kemas M.
    Bijaksana, Moch Arif
    Kurniawan, Adriana
    IEEE ACCESS, 2020, 8 : 147399 - 147406
  • [13] MIXSPEECH: DATA AUGMENTATION FOR LOW-RESOURCE AUTOMATIC SPEECH RECOGNITION
    Meng, Linghui
    Xu, Jin
    Tan, Xu
    Wang, Jindong
    Qin, Tao
    Xu, Bo
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7008 - 7012
  • [14] Data augmentation for low-resource grapheme-to-phoneme mapping
    Hammond, Michael
    SIGMORPHON 2021: 18TH SIGMORPHON WORKSHOP ON COMPUTATIONAL RESEARCH IN PHONETICS, PHONOLOGY, AND MORPHOLOGY, 2021, : 126 - 130
  • [15] Data Augmentation by Concatenation for Low-Resource Translation: A Mystery and a Solution
    Nguyen, Toan Q.
    Murray, Kenton
    Chiang, David
    IWSLT 2021: THE 18TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE TRANSLATION, 2021, : 287 - 293
  • [16] DALE: Generative Data Augmentation for Low-Resource Legal NLP
    Ghosh, Sreyan
    Evuru, Chandra Kiran
    Kumar, Sonal
    Ramaneswaran, S.
    Sakshi, S.
    Tyagi, Utkarsh
    Manocha, Dinesh
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 8511 - 8565
  • [17] Unsupervised Multimodal Machine Translation for Low-resource Distant Language Pairs
    Tayir, Turghun
    Li, Lin
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2024, 23 (04)
  • [18] Data augmentation for low-resource languages NMT guided by constrained sampling
    Maimaiti, Mieradilijiang
    Liu, Yang
    Luan, Huanbo
    Sun, Maosong
    INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 2022, 37 (01) : 30 - 51
  • [19] A Diverse Data Augmentation Strategy for Low-Resource Neural Machine Translation
    Li, Yu
    Li, Xiao
    Yang, Yating
    Dong, Rui
    INFORMATION, 2020, 11 (05)
  • [20] Optimizing the impact of data augmentation for low-resource grammatical error correction
    Solyman, Aiman
    Zappatore, Marco
    Zhenyu, Wang
    Mahmoud, Zeinab
    Alfatemi, Ali
    Ibrahim, Ashraf Osman
    Gabralla, Lubna Abdelkareim
    JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2023, 35 (06)