Speech Emotion Recognition via Sparse Learning-Based Fusion Model

被引:0
|
作者
Min, Dong-Jin [1 ]
Kim, Deok-Hwan [1 ]
机构
[1] Inha Univ, Dept Elect & Comp Engn, Incheon 22212, South Korea
来源
IEEE ACCESS | 2024年 / 12卷
基金
新加坡国家研究基金会;
关键词
Emotion recognition; Speech recognition; Hidden Markov models; Feature extraction; Brain modeling; Accuracy; Convolutional neural networks; Data models; Time-domain analysis; Deep learning; 2D convolutional neural network squeeze and excitation network; multivariate long short-term memory-fully convolutional network; late fusion; sparse learning; FEATURES; DATABASES; ATTENTION; NETWORK;
D O I
10.1109/ACCESS.2024.3506565
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Speech communication is a powerful tool for conveying intentions and emotions, fostering mutual understanding, and strengthening relationships. In the realm of natural human-computer interaction, speech-emotion recognition plays a crucial role. This process involves three stages: dataset collection, feature extraction, and emotion classification. Collecting speech-emotion recognition datasets is a complex and costly process, leading to limited data volumes and uneven emotional distributions. This scarcity and imbalance pose significant challenges, affecting the accuracy and reliability of emotion recognition. To address these issues, this study introduces a novel model that is more robust and adaptive. We employ the Ranking Magnitude Method (RMM) based on sparse learning. We use the Root Mean Square (RMS) energy and Zero Crossing Rate (ZCR) as temporal features to measure the speech's overall volume and noise intensity. The Mel Frequency Cepstral Coefficient (MFCC) is utilized to extract critical speech features, which are then integrated into a multivariate Long Short-Term Memory-Fully Convolutional Network (LSTM-FCN) model. We analyze the utterance levels using the log-Mel spectrogram for spatial features, processing these patterns through a 2D Convolutional Neural Network Squeeze and Excitation Network (CNN-SEN) model. The core of our method is a Sparse Learning-Based Fusion Model (SLBF), which addresses dataset imbalances by selectively retraining the underperforming nodes. This dynamic adjustment of learning priorities significantly enhances the robustness and accuracy of emotion recognition. Using this approach, our model outperforms state-of-the-art methods for various datasets, achieving impressive accuracy rates of 97.18%, 97.92%, 99.31%, and 96.89% for the EMOVO, RAVDESS, SAVE, and EMO-DB datasets, respectively.
引用
收藏
页码:177219 / 177235
页数:17
相关论文
共 50 条
  • [31] A Feature Fusion Model with Data Augmentation for Speech Emotion Recognition
    Tu, Zhongwen
    Liu, Bin
    Zhao, Wei
    Yan, Raoxin
    Zou, Yang
    APPLIED SCIENCES-BASEL, 2023, 13 (07):
  • [32] A Novel Speech Emotion Recognition Method via Transfer PCA and Sparse Coding
    Song, Peng
    Zheng, Wenming
    Liu, Jingjing
    Li, Jing
    Zhang, Xinran
    BIOMETRIC RECOGNITION, CCBR 2015, 2015, 9428 : 393 - 400
  • [33] Application of Transfer Learning-Based English Speech Emotion Recognition in Real-World Scenarios
    Zhang, Ping
    PROCEEDINGS OF 2024 INTERNATIONAL CONFERENCE ON MACHINE INTELLIGENCE AND DIGITAL APPLICATIONS, MIDA2024, 2024, : 224 - 229
  • [34] Contrastive Learning-Based Multimodal Fusion Model for Automatic Modulation Recognition
    Liu, Fugang
    Pan, Jingyi
    Zhou, Ruolin
    IEEE COMMUNICATIONS LETTERS, 2024, 28 (01) : 78 - 82
  • [35] Spontaneous speech emotion recognition via multiple kernel learning
    Zha, Cheng
    Yang, Ping
    Zhang, Xinran
    Zhao, Li
    PROCEEDINGS 2016 EIGHTH INTERNATIONAL CONFERENCE ON MEASURING TECHNOLOGY AND MECHATRONICS AUTOMATION ICMTMA 2016, 2016, : 621 - 623
  • [36] Transfer Learning for Personality Perception via Speech Emotion Recognition
    Li, Yuanchao
    Bell, Peter
    Lai, Catherine
    INTERSPEECH 2023, 2023, : 5197 - 5201
  • [37] Speech Emotion Recognition Based on Multi Acoustic Feature Fusion
    Xiang, Shanshan
    Anwer, Sadiyagul
    Yilahun, Hankiz
    Hamdulla, Askar
    MAN-MACHINE SPEECH COMMUNICATION, NCMMSC 2024, 2025, 2312 : 338 - 346
  • [38] Speech emotion recognition based on multimodal and multiscale feature fusion
    Hu, Huangshui
    Wei, Jie
    Sun, Hongyu
    Wang, Chuhang
    Tao, Shuo
    SIGNAL IMAGE AND VIDEO PROCESSING, 2025, 19 (01)
  • [39] Speech emotion recognition using kernel sparse representation based classifier
    Sharma, Pulkit
    Abrol, Vinayak
    Sachdev, Abhijeet
    Dileep, A. D.
    Sao, Anil Kumar
    2016 24TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2016, : 374 - 377
  • [40] Multi-Modal Fusion Emotion Recognition Method of Speech Expression Based on Deep Learning
    Liu, Dong
    Wang, Zhiyong
    Wang, Lifeng
    Chen, Longxi
    FRONTIERS IN NEUROROBOTICS, 2021, 15