Speech Emotion Recognition via Sparse Learning-Based Fusion Model

被引:0
|
作者
Min, Dong-Jin [1 ]
Kim, Deok-Hwan [1 ]
机构
[1] Inha Univ, Dept Elect & Comp Engn, Incheon 22212, South Korea
来源
IEEE ACCESS | 2024年 / 12卷
基金
新加坡国家研究基金会;
关键词
Emotion recognition; Speech recognition; Hidden Markov models; Feature extraction; Brain modeling; Accuracy; Convolutional neural networks; Data models; Time-domain analysis; Deep learning; 2D convolutional neural network squeeze and excitation network; multivariate long short-term memory-fully convolutional network; late fusion; sparse learning; FEATURES; DATABASES; ATTENTION; NETWORK;
D O I
10.1109/ACCESS.2024.3506565
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Speech communication is a powerful tool for conveying intentions and emotions, fostering mutual understanding, and strengthening relationships. In the realm of natural human-computer interaction, speech-emotion recognition plays a crucial role. This process involves three stages: dataset collection, feature extraction, and emotion classification. Collecting speech-emotion recognition datasets is a complex and costly process, leading to limited data volumes and uneven emotional distributions. This scarcity and imbalance pose significant challenges, affecting the accuracy and reliability of emotion recognition. To address these issues, this study introduces a novel model that is more robust and adaptive. We employ the Ranking Magnitude Method (RMM) based on sparse learning. We use the Root Mean Square (RMS) energy and Zero Crossing Rate (ZCR) as temporal features to measure the speech's overall volume and noise intensity. The Mel Frequency Cepstral Coefficient (MFCC) is utilized to extract critical speech features, which are then integrated into a multivariate Long Short-Term Memory-Fully Convolutional Network (LSTM-FCN) model. We analyze the utterance levels using the log-Mel spectrogram for spatial features, processing these patterns through a 2D Convolutional Neural Network Squeeze and Excitation Network (CNN-SEN) model. The core of our method is a Sparse Learning-Based Fusion Model (SLBF), which addresses dataset imbalances by selectively retraining the underperforming nodes. This dynamic adjustment of learning priorities significantly enhances the robustness and accuracy of emotion recognition. Using this approach, our model outperforms state-of-the-art methods for various datasets, achieving impressive accuracy rates of 97.18%, 97.92%, 99.31%, and 96.89% for the EMOVO, RAVDESS, SAVE, and EMO-DB datasets, respectively.
引用
收藏
页码:177219 / 177235
页数:17
相关论文
共 50 条
  • [41] Efficient bimodal emotion recognition system based on speech/text embeddings and ensemble learning fusion
    Chakhtouna, Adil
    Sekkate, Sara
    Adib, Abdellah
    ANNALS OF TELECOMMUNICATIONS, 2025,
  • [42] Speech based Emotion Recognition using Machine Learning
    Deshmukh, Girija
    Gaonkar, Apurva
    Golwalkar, Gauri
    Kulkarni, Sukanya
    PROCEEDINGS OF THE 2019 3RD INTERNATIONAL CONFERENCE ON COMPUTING METHODOLOGIES AND COMMUNICATION (ICCMC 2019), 2019, : 812 - 817
  • [43] Empirical Interpretation of Speech Emotion Perception with Attention Based Model for Speech Emotion Recognition
    Jalal, Md Asif
    Milner, Rosanna
    Hain, Thomas
    INTERSPEECH 2020, 2020, : 4113 - 4117
  • [44] Deep Learning-Based Amplitude Fusion for Speech Dereverberation
    Liu, Chunlei
    Wang, Longbiao
    Dang, Jianwu
    DISCRETE DYNAMICS IN NATURE AND SOCIETY, 2020, 2020
  • [45] On the Effect of Log-Mel Spectrogram Parameter Tuning for Deep Learning-Based Speech Emotion Recognition
    Mukhamediya, Azamat
    Fazli, Siamac
    Zollanvari, Amin
    IEEE ACCESS, 2023, 11 : 61950 - 61957
  • [46] LEARNING-BASED AUDITORY ENCODING FOR ROBUST SPEECH RECOGNITION
    Chiu, Yu-Hsiang Bosco
    Raj, Bhiksha
    Stern, Richard M.
    2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 4278 - 4281
  • [47] Deep Learning-based Telephony Speech Recognition in the Wild
    Han, Kyu J.
    Hahm, Seongjun
    Kim, Byung-Hak
    Kim, Jungsuk
    Lane, Ian
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 1323 - 1327
  • [48] A Novel Speech Emotion Recognition Method via Incomplete Sparse Least Square Regression
    Zheng, Wenming
    Xin, Minghai
    Wang, Xiaolan
    Wang, Bei
    IEEE SIGNAL PROCESSING LETTERS, 2014, 21 (05) : 569 - 572
  • [49] Learning-Based Auditory Encoding for Robust Speech Recognition
    Chiu, Yu-Hsiang Bosco
    Raj, Bhiksha
    Stern, Richard M.
    IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (03): : 900 - 914
  • [50] Speech Emotion Recognition Based on Acoustic Segment Model
    Zheng, Siyuan
    Du, Jun
    Zhou, Hengshun
    Bai, Xue
    Lee, Chin-Hui
    Li, Shipeng
    2021 12TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2021,