Speech Emotion Recognition via Sparse Learning-Based Fusion Model

被引：0

作者：

Min, Dong-Jin ^{[1
]}

Kim, Deok-Hwan ^{[1
]}

机构：

[1] Inha Univ, Dept Elect & Comp Engn, Incheon 22212, South Korea

来源：

IEEE ACCESS | 2024年 / 12卷

基金：

新加坡国家研究基金会;

关键词：

Emotion recognition; Speech recognition; Hidden Markov models; Feature extraction; Brain modeling; Accuracy; Convolutional neural networks; Data models; Time-domain analysis; Deep learning; 2D convolutional neural network squeeze and excitation network; multivariate long short-term memory-fully convolutional network; late fusion; sparse learning; FEATURES; DATABASES; ATTENTION; NETWORK;

D O I：

10.1109/ACCESS.2024.3506565

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Speech communication is a powerful tool for conveying intentions and emotions, fostering mutual understanding, and strengthening relationships. In the realm of natural human-computer interaction, speech-emotion recognition plays a crucial role. This process involves three stages: dataset collection, feature extraction, and emotion classification. Collecting speech-emotion recognition datasets is a complex and costly process, leading to limited data volumes and uneven emotional distributions. This scarcity and imbalance pose significant challenges, affecting the accuracy and reliability of emotion recognition. To address these issues, this study introduces a novel model that is more robust and adaptive. We employ the Ranking Magnitude Method (RMM) based on sparse learning. We use the Root Mean Square (RMS) energy and Zero Crossing Rate (ZCR) as temporal features to measure the speech's overall volume and noise intensity. The Mel Frequency Cepstral Coefficient (MFCC) is utilized to extract critical speech features, which are then integrated into a multivariate Long Short-Term Memory-Fully Convolutional Network (LSTM-FCN) model. We analyze the utterance levels using the log-Mel spectrogram for spatial features, processing these patterns through a 2D Convolutional Neural Network Squeeze and Excitation Network (CNN-SEN) model. The core of our method is a Sparse Learning-Based Fusion Model (SLBF), which addresses dataset imbalances by selectively retraining the underperforming nodes. This dynamic adjustment of learning priorities significantly enhances the robustness and accuracy of emotion recognition. Using this approach, our model outperforms state-of-the-art methods for various datasets, achieving impressive accuracy rates of 97.18%, 97.92%, 99.31%, and 96.89% for the EMOVO, RAVDESS, SAVE, and EMO-DB datasets, respectively.

引用

页码：177219 / 177235

页数：17

共 50 条

[1] Double sparse learning model for speech emotion recognition
Zong, Yuan
Zheng, Wenming
Cui, Zhen
Li, Qiang
ELECTRONICS LETTERS, 2016, 52 (16) : 1410 - 1411
[2] Deep Learning-Based Emotion Recognition by Fusion of Facial Expressions and Speech Features
Vardhan, Jasthi Vivek
Chakravarti, Yelavarti Kalyan
Chand, Annam Jitin
2024 2ND WORLD CONFERENCE ON COMMUNICATION & COMPUTING, WCONF 2024, 2024,
[3] The Efficacy of Deep Learning-Based Mixed Model for Speech Emotion Recognition
Uddin, Mohammad Amaz
Chowdury, Mohammad Salah Uddin
Khandaker, Mayeen Uddin
Tamam, Nissren
Sulieman, Abdelmoneim
CMC-COMPUTERS MATERIALS & CONTINUA, 2023, 74 (01): : 1709 - 1722
[4] Speech Emotion Recognition Based on Sparse Transfer Learning Method
Song, Peng
Zheng, Wenming
Liang, Ruiyu
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2015, E98D (07) : 1409 - 1412
[5] Feature Fusion of Speech Emotion Recognition Based on Deep Learning
Liu, Gang
He, Wei
Jin, Bicheng
PROCEEDINGS OF 2018 INTERNATIONAL CONFERENCE ON NETWORK INFRASTRUCTURE AND DIGITAL CONTENT (IEEE IC-NIDC), 2018, : 193 - 197
[6] Metric Learning Based Feature Representation with Gated Fusion Model for Speech Emotion Recognition
Gao, Yuan
Liu, JiaXing
Wang, Longbiao
Dang, Jianwu
INTERSPEECH 2021, 2021, : 4503 - 4507
[7] Deep Learning-Based Speech Emotion Recognition Using Multi-Level Fusion of Concurrent Features
Kakuba, Samuel
Poulose, Alwin
Han, Dong Seog
IEEE ACCESS, 2022, 10 : 125538 - 125551
[8] Speech Emotion Recognition Based on Sparse Representation
Yan, Jingjie
Wang, Xiaolan
Gu, Weiyi
Ma, Lili
ARCHIVES OF ACOUSTICS, 2013, 38 (04) : 465 - 470
[9] Speech emotion recognition based on an improved brain emotion learning model
Liu, Zhen-Tao
Xie, Qiao
Wu, Min
Cao, Wei-Hua
Mei, Ying
Mao, Jun-Wei
NEUROCOMPUTING, 2018, 309 : 145 - 156
[10] Multi-language: ensemble learning-based speech emotion recognition
Sruthi, Anumula
Kumar, Anumula Kalyan
Dasari, Kishore
Sivaramaiah, Yenugu
Divya, Garikapati
Kumar, Gunupudi Sai Chaitanya
INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS, 2024, 19 (3) : 453 - 467

← 1 2 3 4 5 →