Hybrid CNN-BiLSTM architecture with multiple attention mechanisms to enhance speech emotion recognition

被引:0
|
作者
Poorna, S. S. [1 ]
Menon, Vivek [2 ]
Gopalan, Sundararaman [1 ]
机构
[1] Amrita Vishwa Vidyapeetham, Dept Elect & Commun Engn, Amritapuri, India
[2] Amrita Vishwa Vidyapeetham, Dept Comp Sci & Engn, Amrita Sch Comp, Amritapuri, India
关键词
SER; CNN; BiLSTM; Mel spectrograms; MFCC; Time-frequency attention; CONVOLUTIONAL NEURAL-NETWORKS; 2D CNN; FEATURES; RECURRENT; REPRESENTATIONS; DATABASES; MODEL;
D O I
10.1016/j.bspc.2024.106967
中图分类号
R318 [生物医学工程];
学科分类号
0831 ;
摘要
During recent years, the concept of attention in deep learning has been increasingly used to boost formance of Speech Emotion Recognition (SER) models. However, these models for SER exhibit shortcomings in jointly emphasizing the time-frequency and dynamic sequential variations, often under-utilizing contextual emotion-related information. We propose a hybrid deep learning model for SER using Convolutional Neural Networks (CNN) and Bidirectional Long Short-Term Memory Networks (BiLSTM) with multiple attention mechanisms. Our model utilizes features from the speech waveform viz. Mel spectrograms and Mel Frequency Cepstral Coefficients (MFCC), along with their time derivatives as input to the CNN and BiLSTM modules, respectively. A Time-Frequency Attention (TFA) mechanism, optimally incorporated into CNN, helps selectively focus on emotion-related energy-time-frequency variations in Mel spectrograms. Attention BiLSTM uses MFCC and its time derivatives to identify the positional information of emotion for addressing the dynamic sequential variations. Finally, we fuse the attention-learned features from the CNN and modules and feed them to a Deep Neural Network (DNN) for SER. The experiments were carried out three different datasets: Emo-DB and IEMOCAP, which are public datasets, and Amritaemo_Arabic; a dataset. The hybrid model exhibited superior performance on both the public and private datasets, generating an average SER accuracy of 94.62%, 67.85%, and 95.80% with Emo-DB, IEMOCAP, and Amritaemo_Arabic datasets, respectively, effectively outperforming several state-of-the-art models.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] Music Emotion Recognition Fusion on CNN-BiLSTM and Self-Attention Model
    Zhong, Zhipeng
    Wang, Hailong
    Su, Guibin
    Liu, Lin
    Pei, Dongmei
    Computer Engineering and Applications, 2024, 59 (03) : 94 - 103
  • [2] Dynamic Music emotion recognition based on CNN-BiLSTM
    Du, Pengfei
    Li, Xiaoyong
    Gao, Yali
    PROCEEDINGS OF 2020 IEEE 5TH INFORMATION TECHNOLOGY AND MECHATRONICS ENGINEERING CONFERENCE (ITOEC 2020), 2020, : 1372 - 1376
  • [3] Enhancing Emotion Recognition in Text with Stacked CNN-BiLSTM Framework
    Banu, N. Nasrin
    Senthilkumar, Radha
    Mohesh, B.
    Giridhar, N. Sabari
    Shanmugasundaram, G.
    2024 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATION AND APPLIED INFORMATICS, ACCAI 2024, 2024,
  • [4] Speech emotion recognition and classification using hybrid deep CNN and BiLSTM model
    Swami Mishra
    Nehal Bhatnagar
    Prakasam P
    Sureshkumar T. R
    Multimedia Tools and Applications, 2024, 83 : 37603 - 37620
  • [5] Speech emotion recognition and classification using hybrid deep CNN and BiLSTM model
    Mishra, Swami
    Bhatnagar, Nehal
    Prakasam, P.
    Sureshkumar, T. R.
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (13) : 37603 - 37620
  • [6] A HYBRID CNN-BILSTM MODEL FOR DRUG NAMED ENTITY RECOGNITION
    Fudholi, Dhomas Hatta
    Nayoan, Royan Abida N.
    Hidayatullah, Ahmad Fathan
    Arianto, Dede Brahma
    JOURNAL OF ENGINEERING SCIENCE AND TECHNOLOGY, 2022, 17 (01): : 730 - 744
  • [7] An Improved Facial Expression Recognition using CNN-BiLSTM with Attention Mechanism
    Jayaraman, Samanthisvaran
    Mahendran, Anand
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2024, 15 (05) : 1307 - 1315
  • [8] A BiLSTM-Transformer and 2D CNN Architecture for Emotion Recognition from Speech
    Kim, Sera
    Lee, Seok-Pil
    ELECTRONICS, 2023, 12 (19)
  • [9] CNN-BiLSTM hybrid neural networks with attention mechanism for well log prediction
    Shan, Liqun
    Liu, Yanchang
    Tang, Min
    Yang, Ming
    Bai, Xueyuan
    JOURNAL OF PETROLEUM SCIENCE AND ENGINEERING, 2021, 205
  • [10] A Combined CNN Architecture for Speech Emotion Recognition
    Begazo, Rolinson
    Aguilera, Ana
    Dongo, Irvin
    Cardinale, Yudith
    SENSORS, 2024, 24 (17)