Transfer learning through perturbation-based in-domain spectrogram augmentation for adult speech recognition

被引:2
|
作者
Kadyan, Virender [1 ]
Bawa, Puneet [2 ]
机构
[1] Univ Petr & Energy Studies UPES, Speech & Language Res Ctr, Sch Comp Sci, Dehra Dun 248007, Uttarakhand, India
[2] Chitkara Univ, Inst Engn & Technol, Ctr Excellence Speech & Multimodal Lab, Rajpura, Punjab, India
来源
NEURAL COMPUTING & APPLICATIONS | 2022年 / 34卷 / 23期
关键词
Deep neural network; Punjabi speech recognition; Data augmentation; Spectrogram augmentation; Transfer learning;
D O I
10.1007/s00521-022-07579-6
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The development of numerous frameworks and pedagogical practices has significantly improved the performance of deep learning-based speech recognition systems in recent years. The task of developing automatic speech recognition (ASR) in indigenous languages becomes enormously complex due to the wide range of auditory and linguistic components due to a lack of speech and text data, which has a significant impact on the ASR system's performance. The main purpose of the research is to effectively use in-domain data augmentation methods and techniques to resolve the challenges of data scarcity, resulting in an increased neural network consistency. This research further goes into more detail about how to create synthetic datasets via pooled augmentation methodologies in conjunction with transfer learning techniques, primarily spectrogram augmentation. Initially, the richness of the signal has been improved through the process of deformation of the time and/or the frequency axis. The time-warping aims to deform the signal's envelope, whereas frequency-warping alters spectral content. Second, the raw signal is examined using audio-level speech perturbation methods such as speed and vocal tract length perturbation. These methods are shown to be effective in addressing the issue of data scarcity while having a low implementation cost, making them simple to implement. Nevertheless, these methods have the effect of effectively increasing the dataset size because multiple versions of a single input are fed into the network during training, likely to result in overfitting. Consequently, an effort has been made to solve the problem of data overfitting by integrating two-level augmentation procedures via pooling of prosody/spectrogram modified and original speech signals using transfer learning techniques. Finally, the adult ASR system was experimented on using deep neural network (DNN) with concatenated feature analysis employing Mel-frequency cepstral coefficients (MFCC), pitch features, and the normalization technique of Vocal Tract Length Normalization (VTLN) on pooled Punjabi datasets, yielding a relative improvement of 41.16 percent in comparison with the baseline system.
引用
收藏
页码:21015 / 21033
页数:19
相关论文
共 50 条
  • [31] Triangular Region Cut-Mix Augmentation Algorithm-Based Speech Emotion Recognition System With Transfer Learning Approach
    Preethi, V.
    Jesi, V. Elizabeth
    IEEE ACCESS, 2024, 12 : 98436 - 98449
  • [32] On the Effect of Log-Mel Spectrogram Parameter Tuning for Deep Learning-Based Speech Emotion Recognition
    Mukhamediya, Azamat
    Fazli, Siamac
    Zollanvari, Amin
    IEEE ACCESS, 2023, 11 : 61950 - 61957
  • [33] Feature Selection Based Transfer Subspace Learning for Speech Emotion Recognition
    Song, Peng
    Zheng, Wenming
    IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2020, 11 (03) : 373 - 382
  • [34] Research on transfer learning for Khalkha Mongolian speech recognition based on TDNN
    Shi, Linyan
    Bao, Feilong
    Wang, Yonghe
    Gao, Guanglai
    2018 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2018, : 85 - 89
  • [35] Speech emotion recognition based on transfer learning from the FaceNet frameworka)
    Liu, Shuhua
    Zhang, Mengyu
    Fang, Ming
    Zhao, Jianwei
    Hou, Kun
    Hung, Chih-Cheng
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2021, 149 (02): : 1338 - 1345
  • [36] A transfer learning-based GAN for data augmentation in automatic modulation recognition
    Gao, Hai
    Ke, Jing
    Lu, Xiaochun
    Cheng, Fang
    Chen, Xiaofei
    ENGINEERING RESEARCH EXPRESS, 2024, 6 (04):
  • [37] Improving CNN-based activity recognition by data augmentation and transfer learning
    Kalouris, Gerasimos
    Zacharaki, Evangelia I.
    Megalooikonomou, Vasileios
    2019 IEEE 17TH INTERNATIONAL CONFERENCE ON INDUSTRIAL INFORMATICS (INDIN), 2019, : 1387 - 1394
  • [38] Deep Learning based Individual Cattle Face Recognition using Data Augmentation and Transfer Learning
    Polat, Havva Eylem
    Koc, Dilara Gerdan
    Ertugrul, Omer
    Koc, Caner
    Ekinci, Kamil
    JOURNAL OF AGRICULTURAL SCIENCES-TARIM BILIMLERI DERGISI, 2025, 31 (01): : 137 - 150
  • [39] UNSUPERVISED DOMAIN ADAPTATION FOR ROBUST SPEECH RECOGNITION VIA VARIATIONAL AUTOENCODER-BASED DATA AUGMENTATION
    Hsu, Wei-Ning
    Zhang, Yu
    Glass, James
    2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2017, : 16 - 23
  • [40] Linked Source and Target Domain Subspace Feature Transfer Learning - Exemplified by Speech Emotion Recognition
    Deng, Jun
    Zhang, Zixing
    Schuller, Bjoern
    2014 22ND INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2014, : 761 - 766