Multi-type features separating fusion learning for Speech Emotion Recognition

被引:15
|
作者
Xu, Xinlei [1 ,2 ]
Li, Dongdong [2 ]
Zhou, Yijun [2 ]
Wang, Zhe [1 ,2 ]
机构
[1] East China Univ Sci Technol, Key Lab Smart Mfg Energy Chem Proc, Minist Educ, Shanghai 200237, Peoples R China
[2] East China Univ Sci & Technol, Dept Comp Sci & Engn, Shanghai 200237, Peoples R China
基金
中国国家自然科学基金;
关键词
Speech emotion recognition; Hybrid feature selection; Feature-level fusion; Speaker-independent; CONVOLUTIONAL NEURAL-NETWORKS; GMM; REPRESENTATIONS; CLASSIFICATION; ADAPTATION; RECURRENT; CNN;
D O I
10.1016/j.asoc.2022.109648
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Speech Emotion Recognition (SER) is a challengeable task to improve human-computer interaction. Speech data have different representations, and choosing the appropriate features to express the emotion behind the speech is difficult. The human brain can comprehensively judge the same thing in different dimensional representations to obtain the final result. Inspired by this, we believe that it is reasonable to have complementary advantages between different representations of speech data. Therefore, a Hybrid Deep Learning with Multi-type features Model (HD-MFM) is proposed to integrate the acoustic, temporal and image information of speech. Specifically, we utilize Convolutional Neural Network (CNN) to extract image information from the spectrogram of speech. Deep Neural Network (DNN) is used for extracting the acoustic information from the statistic features of speech. Then, Long Short-Term Memory (LSTM) is chosen to extract the temporal information from the Mel-Frequency Cepstral Coefficients (MFCC) of speech. Finally, three different types of speech features are concatenated together to get a richer emotion representation with better discriminative property. Considering that different fusion strategies affect the relationship between features, we consider two fusion strategies in this paper named separating and merging. To evaluate the feasibility and effectiveness of the proposed HD-MFM, we perform extensive experiments on EMO-DB and IEMOCAP of SER. The experimental results show that the separating method has more significant advantages in feature complementarity. The proposed HD-MFM obtains 91.25% and 72.02% results on EMO-DB and IEMOCAP. The obtained results indicate the proposed HD-MFM can make full use of the effective complementary feature representations by separating strategy to further enhance the performance of SER. (c) 2022 Elsevier B.V. All rights reserved.
引用
收藏
页数:13
相关论文
共 50 条
  • [41] A dynamic-static feature fusion learning network for speech emotion recognition
    Xue, Peiyun
    Gao, Xiang
    Bai, Jing
    Dong, Zhenan
    Wang, Zhiyu
    Xu, Jiangshuai
    NEUROCOMPUTING, 2025, 633
  • [42] Speech emotion recognition using multimodal feature fusion with machine learning approach
    Panda, Sandeep Kumar
    Jena, Ajay Kumar
    Panda, Mohit Ranjan
    Panda, Susmita
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (27) : 42763 - 42781
  • [43] Speech Emotion Recognition via Sparse Learning-Based Fusion Model
    Min, Dong-Jin
    Kim, Deok-Hwan
    IEEE ACCESS, 2024, 12 : 177219 - 177235
  • [44] Speech Emotion Recognition Based on Feature Fusion
    Shen, Qi
    Chen, Guanggen
    Chang, Lin
    PROCEEDINGS OF THE 2017 2ND INTERNATIONAL CONFERENCE ON MATERIALS SCIENCE, MACHINERY AND ENERGY ENGINEERING (MSMEE 2017), 2017, 123 : 1071 - 1074
  • [45] Speech emotion recognition using multimodal feature fusion with machine learning approach
    Sandeep Kumar Panda
    Ajay Kumar Jena
    Mohit Ranjan Panda
    Susmita Panda
    Multimedia Tools and Applications, 2023, 82 : 42763 - 42781
  • [46] A FEATURE FUSION METHOD BASED ON EXTREME LEARNING MACHINE FOR SPEECH EMOTION RECOGNITION
    Guo, Lili
    Wang, Longbiao
    Dang, Jianwu
    Zhang, Linjuan
    Guan, Haotian
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 2666 - 2670
  • [47] Enhancing speech emotion recognition through deep learning and handcrafted feature fusion
    Eris, Fatma Gunes
    Akbal, Erhan
    APPLIED ACOUSTICS, 2024, 222
  • [48] Design of Efficient Speech Emotion Recognition Based on Multi Task Learning
    Liu, Yunxiang
    Zhang, Kexin
    IEEE ACCESS, 2023, 11 : 5528 - 5537
  • [49] Multi-Classifier Interactive Learning for Ambiguous Speech Emotion Recognition
    Zhou, Ying
    Liang, Xuefeng
    Gu, Yu
    Yin, Yifei
    Yao, Longshan
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 695 - 705
  • [50] MMER: Multimodal Multi-task Learning for Speech Emotion Recognition
    Ghosh, Sreyan
    Tyagi, Utkarsh
    Ramaneswaran, S.
    Srivastava, Harshvardhan
    Manocha, Dinesh
    INTERSPEECH 2023, 2023, : 1209 - 1213