Transfer learning through perturbation-based in-domain spectrogram augmentation for adult speech recognition

被引：2

作者：

Kadyan, Virender ^{[1
]}

Bawa, Puneet ^{[2
]}

机构：

[1] Univ Petr & Energy Studies UPES, Speech & Language Res Ctr, Sch Comp Sci, Dehra Dun 248007, Uttarakhand, India

[2] Chitkara Univ, Inst Engn & Technol, Ctr Excellence Speech & Multimodal Lab, Rajpura, Punjab, India

来源：

NEURAL COMPUTING & APPLICATIONS | 2022年 / 34卷 / 23期

关键词：

Deep neural network; Punjabi speech recognition; Data augmentation; Spectrogram augmentation; Transfer learning;

D O I：

10.1007/s00521-022-07579-6

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The development of numerous frameworks and pedagogical practices has significantly improved the performance of deep learning-based speech recognition systems in recent years. The task of developing automatic speech recognition (ASR) in indigenous languages becomes enormously complex due to the wide range of auditory and linguistic components due to a lack of speech and text data, which has a significant impact on the ASR system's performance. The main purpose of the research is to effectively use in-domain data augmentation methods and techniques to resolve the challenges of data scarcity, resulting in an increased neural network consistency. This research further goes into more detail about how to create synthetic datasets via pooled augmentation methodologies in conjunction with transfer learning techniques, primarily spectrogram augmentation. Initially, the richness of the signal has been improved through the process of deformation of the time and/or the frequency axis. The time-warping aims to deform the signal's envelope, whereas frequency-warping alters spectral content. Second, the raw signal is examined using audio-level speech perturbation methods such as speed and vocal tract length perturbation. These methods are shown to be effective in addressing the issue of data scarcity while having a low implementation cost, making them simple to implement. Nevertheless, these methods have the effect of effectively increasing the dataset size because multiple versions of a single input are fed into the network during training, likely to result in overfitting. Consequently, an effort has been made to solve the problem of data overfitting by integrating two-level augmentation procedures via pooling of prosody/spectrogram modified and original speech signals using transfer learning techniques. Finally, the adult ASR system was experimented on using deep neural network (DNN) with concatenated feature analysis employing Mel-frequency cepstral coefficients (MFCC), pitch features, and the normalization technique of Vocal Tract Length Normalization (VTLN) on pooled Punjabi datasets, yielding a relative improvement of 41.16 percent in comparison with the baseline system.

引用

页码：21015 / 21033

页数：19

共 50 条

[41] LEARNING NOISE INVARIANT FEATURES THROUGH TRANSFER LEARNING FOR ROBUST END-TO-END SPEECH RECOGNITION
Zhang, Shucong
Do, Cong-Thanh
Doddipatla, Rama
Renals, Steve
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7024 - 7028
[42] Thermal error model of machine tool spindle based on in-domain alignment and transfer learning under variable working conditions
Zheng Y.
Fu G.
Lei G.
Zhou L.
Zhu S.
Yi Qi Yi Biao Xue Bao/Chinese Journal of Scientific Instrument, 2023, 44 (05): : 33 - 43
[43] Sparse Autoencoder-based Feature Transfer Learning for Speech Emotion Recognition
Deng, Jun
Zhang, Zixing
Marchi, Erik
Schuller, Bjoern
2013 HUMAINE ASSOCIATION CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII), 2013, : 511 - 516
[44] Research on automatic speech recognition based on a DL-T and transfer learning
Zhang W.
Liu C.
Fei H.-B.
Li W.
Yu J.-H.
Cao Y.
Gongcheng Kexue Xuebao/Chinese Journal of Engineering, 2021, 43 (03): : 433 - 441
[45] Helicopter cockpit speech recognition method based on transfer learning and context biasing
Wang, Guotao
Wang, Jiaqi
Wang, Shicheng
Wu, Qianyu
Teng, Yuru
ENGINEERING RESEARCH EXPRESS, 2024, 6 (03):
[46] SENet-based speech emotion recognition using synthesis-style transfer data augmentation
Rajan R.
Hridya Raj T.V.
International Journal of Speech Technology, 2023, 26 (04) : 1017 - 1030
[47] Unmanned Aerial Vehicle Control through Domain-Based Automatic Speech Recognition
Contreras, Ruben
Ayala, Angel
Cruz, Francisco
COMPUTERS, 2020, 9 (03) : 1 - 15
[48] High-order similarity learning based domain adaptation for speech emotion recognition
Wang, Hao
Ji, Yixuan
Song, Peng
Liu, Zhaowei
APPLIED ACOUSTICS, 2025, 231
[49] Language dialect based speech emotion recognition through deep learning techniques
Sukumar Rajendran
Sandeep Kumar Mathivanan
Prabhu Jayagopal
Maheshwari Venkatasen
Thanapal Pandi
Manivannan Sorakaya Somanathan
Muthamilselvan Thangaval
Prasanna Mani
International Journal of Speech Technology, 2021, 24 : 625 - 635
[50] Interventions in STEM Education Through Speech Recognition-Based Learning Analysis
Lin, Chia-Ju
Wang, Wei-Sheng
Lee, Hsin-Yu
Huang, Yueh-Min
Wu, Ting-Ting
JOURNAL OF EDUCATIONAL COMPUTING RESEARCH, 2025, 63 (02) : 311 - 335

← 1 2 3 4 5 →