Multi-type features separating fusion learning for Speech Emotion Recognition

被引:15
|
作者
Xu, Xinlei [1 ,2 ]
Li, Dongdong [2 ]
Zhou, Yijun [2 ]
Wang, Zhe [1 ,2 ]
机构
[1] East China Univ Sci Technol, Key Lab Smart Mfg Energy Chem Proc, Minist Educ, Shanghai 200237, Peoples R China
[2] East China Univ Sci & Technol, Dept Comp Sci & Engn, Shanghai 200237, Peoples R China
基金
中国国家自然科学基金;
关键词
Speech emotion recognition; Hybrid feature selection; Feature-level fusion; Speaker-independent; CONVOLUTIONAL NEURAL-NETWORKS; GMM; REPRESENTATIONS; CLASSIFICATION; ADAPTATION; RECURRENT; CNN;
D O I
10.1016/j.asoc.2022.109648
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Speech Emotion Recognition (SER) is a challengeable task to improve human-computer interaction. Speech data have different representations, and choosing the appropriate features to express the emotion behind the speech is difficult. The human brain can comprehensively judge the same thing in different dimensional representations to obtain the final result. Inspired by this, we believe that it is reasonable to have complementary advantages between different representations of speech data. Therefore, a Hybrid Deep Learning with Multi-type features Model (HD-MFM) is proposed to integrate the acoustic, temporal and image information of speech. Specifically, we utilize Convolutional Neural Network (CNN) to extract image information from the spectrogram of speech. Deep Neural Network (DNN) is used for extracting the acoustic information from the statistic features of speech. Then, Long Short-Term Memory (LSTM) is chosen to extract the temporal information from the Mel-Frequency Cepstral Coefficients (MFCC) of speech. Finally, three different types of speech features are concatenated together to get a richer emotion representation with better discriminative property. Considering that different fusion strategies affect the relationship between features, we consider two fusion strategies in this paper named separating and merging. To evaluate the feasibility and effectiveness of the proposed HD-MFM, we perform extensive experiments on EMO-DB and IEMOCAP of SER. The experimental results show that the separating method has more significant advantages in feature complementarity. The proposed HD-MFM obtains 91.25% and 72.02% results on EMO-DB and IEMOCAP. The obtained results indicate the proposed HD-MFM can make full use of the effective complementary feature representations by separating strategy to further enhance the performance of SER. (c) 2022 Elsevier B.V. All rights reserved.
引用
收藏
页数:13
相关论文
共 50 条
  • [21] A Multi-scale Fusion Framework for Bimodal Speech Emotion Recognition
    Chen, Ming
    Zhao, Xudong
    INTERSPEECH 2020, 2020, : 374 - 378
  • [22] Meta Multi-task Learning for Speech Emotion Recognition
    Cai, Ruichu
    Guo, Kaibin
    Xu, Boyan
    Yang, Xiaoyan
    Zhang, Zhenjie
    INTERSPEECH 2020, 2020, : 3336 - 3340
  • [23] Speech Emotion Recognition based on Multi-Task Learning
    Zhao, Huijuan
    Han Zhijie
    Wang, Ruchuan
    2019 IEEE 5TH INTL CONFERENCE ON BIG DATA SECURITY ON CLOUD (BIGDATASECURITY) / IEEE INTL CONFERENCE ON HIGH PERFORMANCE AND SMART COMPUTING (HPSC) / IEEE INTL CONFERENCE ON INTELLIGENT DATA AND SECURITY (IDS), 2019, : 186 - 188
  • [24] Speech Emotion Recognition Based on Transfer Emotion-Discriminative Features Subspace Learning
    Zhang, Kexin
    Liu, Yunxiang
    IEEE ACCESS, 2023, 11 : 56336 - 56343
  • [25] Multiple Enhancements to LSTM for Learning Emotion-Salient Features in Speech Emotion Recognition
    Hu, Desheng
    Hu, Xinhui
    Xu, Xinkang
    INTERSPEECH 2022, 2022, : 4720 - 4724
  • [26] Multi-Taper Spectral Features for Emotion Recognition from Speech
    Chapaneri, Santosh V.
    Jayaswal, Deepak D.
    2015 INTERNATIONAL CONFERENCE ON INDUSTRIAL INSTRUMENTATION AND CONTROL (ICIC), 2015, : 1044 - 1049
  • [27] HIERARCHICAL NETWORK BASED ON THE FUSION OF STATIC AND DYNAMIC FEATURES FOR SPEECH EMOTION RECOGNITION
    Cao, Qi
    Hou, Mixiao
    Chen, Bingzhi
    Zhang, Zheng
    Lu, Guangming
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6334 - 6338
  • [28] Lightweight Fusion Model with Time-Frequency Features for Speech Emotion Recognition
    Zhang, Peng
    Li, Meijuan
    Zhao, Hui
    Chen, Yida
    Wang, Fuqiang
    Li, Ye
    Zhao, Wei
    PROCEEDINGS OF THE 2024 27 TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN, CSCWD 2024, 2024, : 3017 - 3022
  • [29] Urdu Speech Emotion Recognition using Speech Spectral Features and Deep Learning Techniques
    Taj, Soonh
    Shaikh, Ghulam Mujtaba
    Hassan, Saif
    Nimra
    2023 4th International Conference on Computing, Mathematics and Engineering Technologies: Sustainable Technologies for Socio-Economic Development, iCoMET 2023, 2023,
  • [30] Deep Learning Algorithms for Speech Emotion Recognition with Hybrid Spectral Features
    Kogila R.
    Sadanandam M.
    Bhukya H.
    SN Computer Science, 5 (1)