Improving Pre-Trained Model-Based Speech Emotion Recognition From a Low-Level Speech Feature Perspective

被引:1
|
作者
Liu, Ke [1 ]
Wei, Jiwei [1 ]
Zou, Jie [1 ]
Wang, Peng [1 ]
Yang, Yang [1 ,2 ]
Shen, Heng Tao [1 ,3 ]
机构
[1] Univ Elect Sci & Technol China, Ctr Future Media, Sch Comp Sci & Engn, Chengdu 611731, Peoples R China
[2] Univ Elect Sci & Technol China, Inst Elect & Informat Engn, Dongguan 523808, Peoples R China
[3] Peng Cheng Lab, Shenzhen 518066, Peoples R China
基金
中国博士后科学基金; 中国国家自然科学基金;
关键词
Feature extraction; Mel frequency cepstral coefficient; Task analysis; Speech recognition; Emotion recognition; Representation learning; Fuses; Multi-view learning; discriminative learning; pre-trained model; MFCC; speech emotion recognition; MULTIVIEW; CLASSIFICATION; ATTENTION;
D O I
10.1109/TMM.2024.3410133
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Multi-view speech emotion recognition (SER) based on the pre-trained model has gained attention in the last two years, which shows great potential in improving the model performance in speaker-independent scenarios. However, the existing work either relies on various fine-tuning methods or uses excessive feature views with complex fusion strategies, causing the increase of complexity with limited performance benefit. In this paper, we improve multi-view SER based on the pre-trained model from the perspective of a low-level speech feature. Specifically, we forgo fine-tuning the pre-trained model and instead focus on learning effective features hidden in the low-level speech feature mel-scale frequency cepstral coefficient (MFCC). We propose a two-stream pooling channel attention (TsPCA) module to discriminatively weight the channel dimensions of the features derived from MFCC. This module enables inter-channel interaction and learning of emotion sequence information across channels. Furthermore, we design a simple but effective feature view fusion strategy to learn robust representations. In the comparison experiments, our method achieves the WA and UA of 73.97%/74.69% and 74.61%/75.66% on the IEMOCAP dataset, 97.21% and 97.11% on the Emo-DB dataset, 77.08% and 77.34% on the RAVDESS dataset, and 74.38% and 71.43% on the SAVEE dataset. Extensive experiments on the four datasets demonstrate that our method consistently surpasses existing methods and achieves a new State-of-the-Art result.
引用
收藏
页码:10623 / 10636
页数:14
相关论文
共 50 条
  • [1] Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition
    Zhang, Hua
    Gou, Ruoyun
    Shang, Jili
    Shen, Fangyao
    Wu, Yifan
    Dai, Guojun
    FRONTIERS IN PHYSIOLOGY, 2021, 12
  • [2] Personalized Adaptation with Pre-trained Speech Encoders for Continuous Emotion Recognition
    Minh Tran
    Yin, Yufeng
    Soleymani, Mohammad
    INTERSPEECH 2023, 2023, : 636 - 640
  • [3] A Novel Policy for Pre-trained Deep Reinforcement Learning for Speech Emotion Recognition
    Rajapakshe, Thejan
    Rana, Rajib
    Khalifa, Sara
    Liu, Jiajun
    Schuller, Bjorn
    2022 AUSTRALIAN COMPUTER SCIENCE WEEK (ACSW 2022), 2022, : 96 - 105
  • [4] Embedding Articulatory Constraints for Low-resource Speech Recognition Based on Large Pre-trained Model
    Lee, Jaeyoung
    Mimura, Masato
    Kawahara, Tatsuya
    INTERSPEECH 2023, 2023, : 1394 - 1398
  • [5] Automatic Speech Recognition Dataset Augmentation with Pre-Trained Model and Script
    Kwon, Minsu
    Choi, Ho-Jin
    2019 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP), 2019, : 649 - 651
  • [6] MTLSER: Multi-task learning enhanced speech emotion recognition with pre-trained acoustic model
    Chen, Zengzhao
    Liu, Chuan
    Wang, Zhifeng
    Zhao, Chuanxu
    Lin, Mengting
    Zheng, Qiuyu
    EXPERT SYSTEMS WITH APPLICATIONS, 2025, 273
  • [7] Interpretabilty of Speech Emotion Recognition modelled using Self-Supervised Speech and Text Pre-Trained Embeddings
    Girish, K. V. Vijay
    Konjeti, Srikanth
    Vepa, Jithendra
    INTERSPEECH 2022, 2022, : 4496 - 4500
  • [8] Pre-trained Speech Processing Models Contain Human-Like Biases that Propagate to Speech Emotion Recognition
    Slaughter, Isaac
    Greenberg, Craig
    Schwartz, Reva
    Caliskan, Aylin
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 8967 - 8989
  • [9] IMPROVING CTC-BASED SPEECH RECOGNITION VIA KNOWLEDGE TRANSFERRING FROM PRE-TRAINED LANGUAGE MODELS
    Deng, Keqi
    Cao, Songjun
    Zhang, Yike
    Ma, Long
    Cheng, Gaofeng
    Xu, Ji
    Zhang, Pengyuan
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8517 - 8521
  • [10] FEATURE EXTRACTION USING PRE-TRAINED CONVOLUTIVE BOTTLENECK NETS FOR DYSARTHRIC SPEECH RECOGNITION
    Takashima, Yuki
    Nakashika, Toru
    Takiguchi, Tetsuya
    Ariki, Yasuo
    2015 23RD EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2015, : 1411 - 1415