Developing children's ASR system under low-resource conditions using end-to-end architecture

被引：1

作者：

Ankita ^{[1
]}

Shahnawazuddin, S. ^{[1
]}

机构：

[1] Natl Inst Technol Patna, Dept Elect & Commun Engn, Patna, Bihar, India

来源：

DIGITAL SIGNAL PROCESSING | 2024年 / 146卷

关键词：

End-to-end speech recognition; Low-resource children's ASR; Out-of-domain data augmentation; In-domain data augmentation; GFCC; FDLP; PROSODY MODIFICATION; LINEAR PREDICTION; SPEECH; RECOGNITION; FEATURES; VARIABILITY;

D O I：

10.1016/j.dsp.2024.104385

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

The work presented in this paper aims at enhancing the performance of end -to -end (E2E) speech recognition task for children's speech under low resource conditions. For majority of the languages, there is hardly any speech data from child speakers. Furthermore, even the available children's speech corpora are limited in terms of the number of hours of data. On the other hand, large amounts of adults' speech data are freely available for research as well as commercial purposes. As a consequence, developing an effective E2E automatic speech recognition (ASR) system for children becomes a very challenging task. One may develop an ASR system using adults' speech and then use it to transcribe children's data, but this leads to very poor recognition rates due to the stark differences in the acoustic attributes of adults' and children's speech. In order to overcome these hurdles and to develop a robust children's ASR system employing E2E architecture, we have resorted to several out -of -domain and in -domain data augmentation techniques. For out -of -domain data augmentation, we have explicitly modified adults' speech to render it acoustically similar to that of children's speech before pooling into training. On the other hand, in the case of in -domain data augmentation, we have slightly modified the pitch and duration of children's speech in order to create more data capturing greater diversity. Data augmentation approaches helps in mitigating the ill-effects resulting from the scarcity of data from child domain to a certain extent. This, in turn, reduces the error rates by a large margin. In addition to data augmentation, we have also studied the efficacy of Gamma -tone frequency cepstral coefficients (GFCC) and frequency domain linear prediction (FDLP) technique along with the most commonly used Mel -frequency cepstral coefficients (MFCC) for front-end speech parameterization. Both MFCC as well as GFCC capture and model the spectral envelope of speech. On the other hand, application of linear prediction on the frequency domain representation of speech signal helps to effectively capture the temporal envelope during front-end feature extraction. Employing FDLP features that model the temporal envelope provides important cues for the perception and understanding of stop bursts and, at times, complete phonemes. This motivated us to perform a comparative experimental study of the effectiveness of the three aforementioned front-end acoustic features. In our experimental explorations, the use of proposed data augmentation in combination of FDLP features has shown a relative improvement in character error rate by 67.6% over the baseline system. The combination of data augmentation with MFCC or GFCC features is observed to result in lower recognition performances.

引用

页数：9

共 50 条

[1] Pretraining by Backtranslation for End-to-end ASR in Low-Resource Settings
Wiesner, Matthew
Renduchintala, Adithya
Watanabe, Shinji
Liu, Chunxi
Dehak, Najim
Khudanpur, Sanjeev
INTERSPEECH 2019, 2019, : 4375 - 4379
[2] Multilingual end-to-end ASR for low-resource Turkic languages with common alphabets
Bekarystankyzy, Akbayan
Mamyrbayev, Orken
Mendes, Mateus
Fazylzhanova, Anar
Assam, Muhammad
SCIENTIFIC REPORTS, 2024, 14 (01):
[3] Improving low-resource Tibetan end-to-end ASR by multilingual and multilevel unit modeling
Qin, Siqing
Wang, Longbiao
Li, Sheng
Dang, Jianwu
Pan, Lixin
EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2022, 2022 (01)
[4] Improving low-resource Tibetan end-to-end ASR by multilingual and multilevel unit modeling
Siqing Qin
Longbiao Wang
Sheng Li
Jianwu Dang
Lixin Pan
EURASIP Journal on Audio, Speech, and Music Processing, 2022
[5] Transfer Learning for End-to-End ASR to Deal with Low-Resource Problem in Persian Language
Kermanshahi, Maryam Asadolahzade
Akbari, Ahmad
Nasersharif, Babak
2021 26TH INTERNATIONAL COMPUTER CONFERENCE, COMPUTER SOCIETY OF IRAN (CSICC), 2021,
[6] Effective Training End-to-End ASR systems for Low-resource Lhasa Dialect of Tibetan Language
Pan, Lixin
Li, Sheng
Wang, Longbiao
Dang, Jianwu
2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 1152 - 1156
[7] Exploring End-to-End Techniques for Low-Resource Speech Recognition
Bataev, Vladimir
Korenevsky, Maxim
Medennikov, Ivan
Zatvornitskiy, Alexander
SPEECH AND COMPUTER (SPECOM 2018), 2018, 11096 : 32 - 41
[8] META LEARNING FOR END-TO-END LOW-RESOURCE SPEECH RECOGNITION
Hsu, Jui-Yang
Chen, Yuan-Jui
Lee, Hung-yi
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7844 - 7848
[9] Data Augmentation Using CycleGAN for End-to-End Children ASR
Singh, Dipesh K.
Amin, Preet P.
Sailor, Hardik B.
Patil, Hemant A.
29TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2021), 2021, : 511 - 515
[10] COMBINING END-TO-END AND ADVERSARIAL TRAINING FOR LOW-RESOURCE SPEECH RECOGNITION
Drexler, Jennifer
Glass, James
2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 361 - 368

← 1 2 3 4 5 →