Improving Pre-Trained Model-Based Speech Emotion Recognition From a Low-Level Speech Feature Perspective

被引:1
|
作者
Liu, Ke [1 ]
Wei, Jiwei [1 ]
Zou, Jie [1 ]
Wang, Peng [1 ]
Yang, Yang [1 ,2 ]
Shen, Heng Tao [1 ,3 ]
机构
[1] Univ Elect Sci & Technol China, Ctr Future Media, Sch Comp Sci & Engn, Chengdu 611731, Peoples R China
[2] Univ Elect Sci & Technol China, Inst Elect & Informat Engn, Dongguan 523808, Peoples R China
[3] Peng Cheng Lab, Shenzhen 518066, Peoples R China
基金
中国博士后科学基金; 中国国家自然科学基金;
关键词
Feature extraction; Mel frequency cepstral coefficient; Task analysis; Speech recognition; Emotion recognition; Representation learning; Fuses; Multi-view learning; discriminative learning; pre-trained model; MFCC; speech emotion recognition; MULTIVIEW; CLASSIFICATION; ATTENTION;
D O I
10.1109/TMM.2024.3410133
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Multi-view speech emotion recognition (SER) based on the pre-trained model has gained attention in the last two years, which shows great potential in improving the model performance in speaker-independent scenarios. However, the existing work either relies on various fine-tuning methods or uses excessive feature views with complex fusion strategies, causing the increase of complexity with limited performance benefit. In this paper, we improve multi-view SER based on the pre-trained model from the perspective of a low-level speech feature. Specifically, we forgo fine-tuning the pre-trained model and instead focus on learning effective features hidden in the low-level speech feature mel-scale frequency cepstral coefficient (MFCC). We propose a two-stream pooling channel attention (TsPCA) module to discriminatively weight the channel dimensions of the features derived from MFCC. This module enables inter-channel interaction and learning of emotion sequence information across channels. Furthermore, we design a simple but effective feature view fusion strategy to learn robust representations. In the comparison experiments, our method achieves the WA and UA of 73.97%/74.69% and 74.61%/75.66% on the IEMOCAP dataset, 97.21% and 97.11% on the Emo-DB dataset, 77.08% and 77.34% on the RAVDESS dataset, and 74.38% and 71.43% on the SAVEE dataset. Extensive experiments on the four datasets demonstrate that our method consistently surpasses existing methods and achieves a new State-of-the-Art result.
引用
收藏
页码:10623 / 10636
页数:14
相关论文
共 50 条
  • [31] SPEECHCLIP: INTEGRATING SPEECH WITH PRE-TRAINED VISION AND LANGUAGE MODEL
    Shih, Yi-Jen
    Wang, Hsuan-Fu
    Chang, Heng-Jui
    Berry, Layne
    Lee, Hung-yi
    Harwath, David
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 715 - 722
  • [32] Speech Emotion Recognition Based on Feature Fusion
    Shen, Qi
    Chen, Guanggen
    Chang, Lin
    PROCEEDINGS OF THE 2017 2ND INTERNATIONAL CONFERENCE ON MATERIALS SCIENCE, MACHINERY AND ENERGY ENGINEERING (MSMEE 2017), 2017, 123 : 1071 - 1074
  • [33] Strategies for improving low resource speech to text translation relying on pre-trained ASR models
    Kesiraju, Santosh
    Sarvas, Marek
    Pavlicek, Tomas
    Macaire, Cecile
    Ciuba, Alejandro
    INTERSPEECH 2023, 2023, : 2148 - 2152
  • [34] Adapter Learning from Pre-trained Model for Robust Spoof Speech Detection
    Wu, Haochen
    Guo, Wu
    Peng, Shengyu
    Li, Zhuhai
    Zhang, Jie
    INTERSPEECH 2024, 2024, : 2095 - 2099
  • [35] An algorithm study for speech emotion recognition based speech feature analysis
    Zhengbiao, Ji
    Feng, Zhou
    Ming, Zhu
    International Journal of Multimedia and Ubiquitous Engineering, 2015, 10 (11): : 33 - 42
  • [36] Speech Topic Classification Based on Pre-trained and Graph Networks
    Niu, Fangjing
    Cao, Tengfei
    Hu, Ying
    Huang, Hao
    He, Liang
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 1721 - 1726
  • [37] LoRA-MER: Low-Rank Adaptation of Pre-Trained Speech Models for Multimodal Emotion Recognition Using Mutual Information
    Cai, Yunrui
    Wu, Zhiyong
    Jia, Jia
    Meng, Helen
    INTERSPEECH 2024, 2024, : 4658 - 4662
  • [38] Comparing Pre-trained and Feature-Based Models for Prediction of Alzheimer's Disease Based on Speech
    Balagopalan, Aparna
    Eyre, Benjamin
    Robin, Jessica
    Rudzicz, Frank
    Novikova, Jekaterina
    FRONTIERS IN AGING NEUROSCIENCE, 2021, 13
  • [39] Speech emotion analysis and recognition based on the PCA feature extraction model
    Ye, Shiping
    International Journal of Applied Mathematics and Statistics, 2013, 51 (22): : 127 - 135
  • [40] Improving Speech Separation with Knowledge Distilled from Self-supervised Pre-trained Models
    Qu, Bowen
    Li, Chenda
    Bai, Jinfeng
    Qian, Yanmin
    2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2022, : 329 - 333