Improving Pre-Trained Model-Based Speech Emotion Recognition From a Low-Level Speech Feature Perspective

被引:1
|
作者
Liu, Ke [1 ]
Wei, Jiwei [1 ]
Zou, Jie [1 ]
Wang, Peng [1 ]
Yang, Yang [1 ,2 ]
Shen, Heng Tao [1 ,3 ]
机构
[1] Univ Elect Sci & Technol China, Ctr Future Media, Sch Comp Sci & Engn, Chengdu 611731, Peoples R China
[2] Univ Elect Sci & Technol China, Inst Elect & Informat Engn, Dongguan 523808, Peoples R China
[3] Peng Cheng Lab, Shenzhen 518066, Peoples R China
基金
中国博士后科学基金; 中国国家自然科学基金;
关键词
Feature extraction; Mel frequency cepstral coefficient; Task analysis; Speech recognition; Emotion recognition; Representation learning; Fuses; Multi-view learning; discriminative learning; pre-trained model; MFCC; speech emotion recognition; MULTIVIEW; CLASSIFICATION; ATTENTION;
D O I
10.1109/TMM.2024.3410133
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Multi-view speech emotion recognition (SER) based on the pre-trained model has gained attention in the last two years, which shows great potential in improving the model performance in speaker-independent scenarios. However, the existing work either relies on various fine-tuning methods or uses excessive feature views with complex fusion strategies, causing the increase of complexity with limited performance benefit. In this paper, we improve multi-view SER based on the pre-trained model from the perspective of a low-level speech feature. Specifically, we forgo fine-tuning the pre-trained model and instead focus on learning effective features hidden in the low-level speech feature mel-scale frequency cepstral coefficient (MFCC). We propose a two-stream pooling channel attention (TsPCA) module to discriminatively weight the channel dimensions of the features derived from MFCC. This module enables inter-channel interaction and learning of emotion sequence information across channels. Furthermore, we design a simple but effective feature view fusion strategy to learn robust representations. In the comparison experiments, our method achieves the WA and UA of 73.97%/74.69% and 74.61%/75.66% on the IEMOCAP dataset, 97.21% and 97.11% on the Emo-DB dataset, 77.08% and 77.34% on the RAVDESS dataset, and 74.38% and 71.43% on the SAVEE dataset. Extensive experiments on the four datasets demonstrate that our method consistently surpasses existing methods and achieves a new State-of-the-Art result.
引用
收藏
页码:10623 / 10636
页数:14
相关论文
共 50 条
  • [21] Pre-trained Speech Representations as Feature Extractors for Speech Quality Assessment in Online Conferencing Applications
    Tamm, Bastiaan
    Balabin, Helena
    Vandenberghe, Rik
    Van Hamme, Hugo
    INTERSPEECH 2022, 2022, : 4083 - 4087
  • [22] GENERATING HUMAN READABLE TRANSCRIPT FOR AUTOMATIC SPEECH RECOGNITION WITH PRE-TRAINED LANGUAGE MODEL
    Liao, Junwei
    Shi, Yu
    Gong, Ming
    Shou, Linjun
    Eskimez, Sefik
    Lu, Liyang
    Qu, Hong
    Zeng, Michael
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7578 - 7582
  • [23] ON THE USE OF SELF-SUPERVISED PRE-TRAINED ACOUSTIC AND LINGUISTIC FEATURES FOR CONTINUOUS SPEECH EMOTION RECOGNITION
    Macary, Manon
    Tahon, Marie
    Esteve, Yannick
    Rousseau, Anthony
    2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 373 - 380
  • [24] PEFT-SER: On the Use of Parameter Efficient Transfer Learning Approaches For Speech Emotion Recognition Using Pre-trained Speech Models
    Feng, Tiantian
    Narayanan, Shrikanth
    2023 11TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION, ACII, 2023,
  • [25] Speech emotion recognition based on syllable-level feature extraction
    Rehman, Abdul
    Liu, Zhen-Tao
    Wu, Min
    Cao, Wei-Hua
    Jiang, Cheng-Shan
    APPLIED ACOUSTICS, 2023, 211
  • [26] An autoencoder-based feature level fusion for speech emotion recognition
    Peng Shixin
    Chen Kai
    Tian Tian
    Chen Jingying
    Digital Communications and Networks, 2024, 10 (05) : 1341 - 1351
  • [27] An autoencoder-based feature level fusion for speech emotion recognition
    Peng, Shixin
    Kai, Chen
    Tian, Tian
    Chen, Jingying
    DIGITAL COMMUNICATIONS AND NETWORKS, 2024, 10 (05) : 1341 - 1351
  • [28] On the Usage of Pre-Trained Speech Recognition Deep Layers to Detect Emotions
    Oliveira, Jorge
    Praca, Isabel
    IEEE ACCESS, 2021, 9 : 9699 - 9705
  • [29] Leveraging Pre-trained Language Model for Speech Sentiment Analysis
    Shon, Suwon
    Brusco, Pablo
    Pan, Jing
    Han, Kyu J.
    Watanabe, Shinji
    INTERSPEECH 2021, 2021, : 3420 - 3424
  • [30] How to Estimate Model Transferability of Pre-Trained Speech Models?
    Chen, Zih-Ching
    Yang, Chao-Han Huck
    Li, Bo
    Zhang, Yu
    Chen, Nanxin
    Chang, Shou-Yiin
    Prabhavalkar, Rohit
    Lee, Hung-yi
    Sainath, Tara N.
    INTERSPEECH 2023, 2023, : 456 - 460