Developing children's ASR system under low-resource conditions using end-to-end architecture

被引:1
|
作者
Ankita [1 ]
Shahnawazuddin, S. [1 ]
机构
[1] Natl Inst Technol Patna, Dept Elect & Commun Engn, Patna, Bihar, India
关键词
End-to-end speech recognition; Low-resource children's ASR; Out-of-domain data augmentation; In-domain data augmentation; GFCC; FDLP; PROSODY MODIFICATION; LINEAR PREDICTION; SPEECH; RECOGNITION; FEATURES; VARIABILITY;
D O I
10.1016/j.dsp.2024.104385
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
The work presented in this paper aims at enhancing the performance of end -to -end (E2E) speech recognition task for children's speech under low resource conditions. For majority of the languages, there is hardly any speech data from child speakers. Furthermore, even the available children's speech corpora are limited in terms of the number of hours of data. On the other hand, large amounts of adults' speech data are freely available for research as well as commercial purposes. As a consequence, developing an effective E2E automatic speech recognition (ASR) system for children becomes a very challenging task. One may develop an ASR system using adults' speech and then use it to transcribe children's data, but this leads to very poor recognition rates due to the stark differences in the acoustic attributes of adults' and children's speech. In order to overcome these hurdles and to develop a robust children's ASR system employing E2E architecture, we have resorted to several out -of -domain and in -domain data augmentation techniques. For out -of -domain data augmentation, we have explicitly modified adults' speech to render it acoustically similar to that of children's speech before pooling into training. On the other hand, in the case of in -domain data augmentation, we have slightly modified the pitch and duration of children's speech in order to create more data capturing greater diversity. Data augmentation approaches helps in mitigating the ill-effects resulting from the scarcity of data from child domain to a certain extent. This, in turn, reduces the error rates by a large margin. In addition to data augmentation, we have also studied the efficacy of Gamma -tone frequency cepstral coefficients (GFCC) and frequency domain linear prediction (FDLP) technique along with the most commonly used Mel -frequency cepstral coefficients (MFCC) for front-end speech parameterization. Both MFCC as well as GFCC capture and model the spectral envelope of speech. On the other hand, application of linear prediction on the frequency domain representation of speech signal helps to effectively capture the temporal envelope during front-end feature extraction. Employing FDLP features that model the temporal envelope provides important cues for the perception and understanding of stop bursts and, at times, complete phonemes. This motivated us to perform a comparative experimental study of the effectiveness of the three aforementioned front-end acoustic features. In our experimental explorations, the use of proposed data augmentation in combination of FDLP features has shown a relative improvement in character error rate by 67.6% over the baseline system. The combination of data augmentation with MFCC or GFCC features is observed to result in lower recognition performances.
引用
收藏
页数:9
相关论文
共 50 条
  • [1] Pretraining by Backtranslation for End-to-end ASR in Low-Resource Settings
    Wiesner, Matthew
    Renduchintala, Adithya
    Watanabe, Shinji
    Liu, Chunxi
    Dehak, Najim
    Khudanpur, Sanjeev
    INTERSPEECH 2019, 2019, : 4375 - 4379
  • [2] Multilingual end-to-end ASR for low-resource Turkic languages with common alphabets
    Bekarystankyzy, Akbayan
    Mamyrbayev, Orken
    Mendes, Mateus
    Fazylzhanova, Anar
    Assam, Muhammad
    SCIENTIFIC REPORTS, 2024, 14 (01):
  • [3] Improving low-resource Tibetan end-to-end ASR by multilingual and multilevel unit modeling
    Qin, Siqing
    Wang, Longbiao
    Li, Sheng
    Dang, Jianwu
    Pan, Lixin
    EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2022, 2022 (01)
  • [4] Improving low-resource Tibetan end-to-end ASR by multilingual and multilevel unit modeling
    Siqing Qin
    Longbiao Wang
    Sheng Li
    Jianwu Dang
    Lixin Pan
    EURASIP Journal on Audio, Speech, and Music Processing, 2022
  • [5] Transfer Learning for End-to-End ASR to Deal with Low-Resource Problem in Persian Language
    Kermanshahi, Maryam Asadolahzade
    Akbari, Ahmad
    Nasersharif, Babak
    2021 26TH INTERNATIONAL COMPUTER CONFERENCE, COMPUTER SOCIETY OF IRAN (CSICC), 2021,
  • [6] Effective Training End-to-End ASR systems for Low-resource Lhasa Dialect of Tibetan Language
    Pan, Lixin
    Li, Sheng
    Wang, Longbiao
    Dang, Jianwu
    2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 1152 - 1156
  • [7] Exploring End-to-End Techniques for Low-Resource Speech Recognition
    Bataev, Vladimir
    Korenevsky, Maxim
    Medennikov, Ivan
    Zatvornitskiy, Alexander
    SPEECH AND COMPUTER (SPECOM 2018), 2018, 11096 : 32 - 41
  • [8] META LEARNING FOR END-TO-END LOW-RESOURCE SPEECH RECOGNITION
    Hsu, Jui-Yang
    Chen, Yuan-Jui
    Lee, Hung-yi
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7844 - 7848
  • [9] Data Augmentation Using CycleGAN for End-to-End Children ASR
    Singh, Dipesh K.
    Amin, Preet P.
    Sailor, Hardik B.
    Patil, Hemant A.
    29TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2021), 2021, : 511 - 515
  • [10] COMBINING END-TO-END AND ADVERSARIAL TRAINING FOR LOW-RESOURCE SPEECH RECOGNITION
    Drexler, Jennifer
    Glass, James
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 361 - 368