A Combined CNN Architecture for Speech Emotion Recognition

被引:1
|
作者
Begazo, Rolinson [1 ]
Aguilera, Ana [2 ,3 ]
Dongo, Irvin [1 ,4 ]
Cardinale, Yudith [5 ]
机构
[1] Univ Catolica San Pablo, Elect & Elect Engn Dept, Arequipa 04001, Peru
[2] Univ Valparaiso, Fac Ingn, Escuela Ingn Informat, Valparaiso 2340000, Chile
[3] Univ Valparaiso, Interdisciplinary Ctr Biomed Res & Hlth Engn MEDIN, Valparaiso 2340000, Chile
[4] Univ Bordeaux, ESTIA Inst Technol, F-64210 Bidart, France
[5] Univ Int Valencia, Grp Invest Ciencia Datos, Valencia 46002, Spain
关键词
speech emotion recognition; deep learning; spectral features; spectrogram imaging; feature fusion; convolutional neural network; NEURAL-NETWORKS; FEATURES; CORPUS;
D O I
10.3390/s24175797
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Emotion recognition through speech is a technique employed in various scenarios of Human-Computer Interaction (HCI). Existing approaches have achieved significant results; however, limitations persist, with the quantity and diversity of data being more notable when deep learning techniques are used. The lack of a standard in feature selection leads to continuous development and experimentation. Choosing and designing the appropriate network architecture constitutes another challenge. This study addresses the challenge of recognizing emotions in the human voice using deep learning techniques, proposing a comprehensive approach, and developing preprocessing and feature selection stages while constructing a dataset called EmoDSc as a result of combining several available databases. The synergy between spectral features and spectrogram images is investigated. Independently, the weighted accuracy obtained using only spectral features was 89%, while using only spectrogram images, the weighted accuracy reached 90%. These results, although surpassing previous research, highlight the strengths and limitations when operating in isolation. Based on this exploration, a neural network architecture composed of a CNN1D, a CNN2D, and an MLP that fuses spectral features and spectogram images is proposed. The model, supported by the unified dataset EmoDSc, demonstrates a remarkable accuracy of 96%.
引用
收藏
页数:39
相关论文
共 50 条
  • [41] Cascaded Convolutional Neural Network Architecture for Speech Emotion Recognition in Noisy Conditions
    Nam, Youngja
    Lee, Chankyu
    SENSORS, 2021, 21 (13)
  • [42] EmotionNAS: Two-stream Neural Architecture Search for Speech Emotion Recognition
    Sun, Haiyang
    Lian, Zheng
    Liu, Bin
    Li, Ying
    Sun, Licai
    Cai, Cong
    Tao, Jianhua
    Wang, Meng
    Cheng, Yuan
    INTERSPEECH 2023, 2023, : 3597 - 3601
  • [43] Empirical Analysis of Shallow and Deep Architecture Classifiers on Emotion Recognition from Speech
    Singh, Vaibhav
    Sharma, Kapil
    2019 6TH IEEE INTERNATIONAL CONFERENCE ON CYBER SECURITY AND CLOUD COMPUTING (IEEE CSCLOUD 2019) / 2019 5TH IEEE INTERNATIONAL CONFERENCE ON EDGE COMPUTING AND SCALABLE CLOUD (IEEE EDGECOM 2019), 2019, : 69 - 73
  • [44] Memristor-Based Progressive Hierarchical Conformer Architecture for Speech Emotion Recognition
    Zhao, Tianhao
    Zhou, Yue
    Hu, Xiaofang
    INTERNATIONAL JOURNAL OF BIFURCATION AND CHAOS, 2024, 34 (09):
  • [45] Speech Emotion Recognition using Dual-Conv2D architecture
    Ayadi, Souha
    Lachiri, Zied
    PRZEGLAD ELEKTROTECHNICZNY, 2024, 100 (06): : 209 - 211
  • [46] A Hybrid Time-Distributed Deep Neural Architecture for Speech Emotion Recognition
    De Lope, Javier
    Grana, Manuel
    INTERNATIONAL JOURNAL OF NEURAL SYSTEMS, 2022, 32 (06)
  • [47] Speech emotion recognition based on emotion perception
    Gang Liu
    Shifang Cai
    Ce Wang
    EURASIP Journal on Audio, Speech, and Music Processing, 2023
  • [48] Speech emotion recognition based on emotion perception
    Liu, Gang
    Cai, Shifang
    Wang, Ce
    EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2023, 2023 (01)
  • [49] Autoencoder With Emotion Embedding for Speech Emotion Recognition
    Zhang, Chenghao
    Xue, Lei
    IEEE ACCESS, 2021, 9 : 51231 - 51241
  • [50] Multichannel CNN-BLSTM Architecture for Speech Emotion Recognition System by Fusion of Magnitude and Phase Spectral Features Using DCCA for Consumer Applications
    Prabhakar, Gudmalwar Ashishkumar
    Basel, Biplove
    Dutta, Anirban
    Rao, Ch. V. Rama
    IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, 2023, 69 (02) : 226 - 235