Improving Pre-Trained Model-Based Speech Emotion Recognition From a Low-Level Speech Feature Perspective

被引:1
|
作者
Liu, Ke [1 ]
Wei, Jiwei [1 ]
Zou, Jie [1 ]
Wang, Peng [1 ]
Yang, Yang [1 ,2 ]
Shen, Heng Tao [1 ,3 ]
机构
[1] Univ Elect Sci & Technol China, Ctr Future Media, Sch Comp Sci & Engn, Chengdu 611731, Peoples R China
[2] Univ Elect Sci & Technol China, Inst Elect & Informat Engn, Dongguan 523808, Peoples R China
[3] Peng Cheng Lab, Shenzhen 518066, Peoples R China
基金
中国博士后科学基金; 中国国家自然科学基金;
关键词
Feature extraction; Mel frequency cepstral coefficient; Task analysis; Speech recognition; Emotion recognition; Representation learning; Fuses; Multi-view learning; discriminative learning; pre-trained model; MFCC; speech emotion recognition; MULTIVIEW; CLASSIFICATION; ATTENTION;
D O I
10.1109/TMM.2024.3410133
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Multi-view speech emotion recognition (SER) based on the pre-trained model has gained attention in the last two years, which shows great potential in improving the model performance in speaker-independent scenarios. However, the existing work either relies on various fine-tuning methods or uses excessive feature views with complex fusion strategies, causing the increase of complexity with limited performance benefit. In this paper, we improve multi-view SER based on the pre-trained model from the perspective of a low-level speech feature. Specifically, we forgo fine-tuning the pre-trained model and instead focus on learning effective features hidden in the low-level speech feature mel-scale frequency cepstral coefficient (MFCC). We propose a two-stream pooling channel attention (TsPCA) module to discriminatively weight the channel dimensions of the features derived from MFCC. This module enables inter-channel interaction and learning of emotion sequence information across channels. Furthermore, we design a simple but effective feature view fusion strategy to learn robust representations. In the comparison experiments, our method achieves the WA and UA of 73.97%/74.69% and 74.61%/75.66% on the IEMOCAP dataset, 97.21% and 97.11% on the Emo-DB dataset, 77.08% and 77.34% on the RAVDESS dataset, and 74.38% and 71.43% on the SAVEE dataset. Extensive experiments on the four datasets demonstrate that our method consistently surpasses existing methods and achieves a new State-of-the-Art result.
引用
收藏
页码:10623 / 10636
页数:14
相关论文
共 50 条
  • [41] End-to-end speech topic classification based on pre-trained model Wavlm
    Cao, Tengfei
    He, Liang
    Niu, Fangjing
    2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2022, : 369 - 373
  • [42] Chinese cyber-violent Speech Detection and Analysis Based on Pre-trained Model
    Zhou, Sunrui
    2024 5TH INTERNATIONAL CONFERENCE ON COMPUTING, NETWORKS AND INTERNET OF THINGS, CNIOT 2024, 2024, : 443 - 447
  • [43] Comparing Pre-Trained Language Model for Arabic Hate Speech Detection
    Daouadi, Kheir Eddine
    Boualleg, Yaakoub
    Guehairia, Oussama
    COMPUTACION Y SISTEMAS, 2024, 28 (02): : 681 - 693
  • [44] Automatic Prosody Annotation with Pre-Trained Text-Speech Model
    Dai, Ziqian
    Yu, Jianwei
    Wang, Yan
    Chen, Nuo
    Bian, Yanyao
    Li, Guangzhi
    Cai, Deng
    Yu, Dong
    INTERSPEECH 2022, 2022, : 5513 - 5517
  • [45] Pre-trained Model Based Feature Envy Detection
    Ma, Wenhao
    Yu, Yaoxiang
    Ruan, Xiaoming
    Cai, Bo
    2023 IEEE/ACM 20TH INTERNATIONAL CONFERENCE ON MINING SOFTWARE REPOSITORIES, MSR, 2023, : 430 - 440
  • [46] IMPROVING NON-AUTOREGRESSIVE END-TO-END SPEECH RECOGNITION WITH PRE-TRAINED ACOUSTIC AND LANGUAGE MODELS
    Deng, Keqi
    Yang, Zehui
    Watanabe, Shinji
    Higuchi, Yosuke
    Cheng, Gaofeng
    Zhang, Pengyuan
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8522 - 8526
  • [47] Speech Emotion Recognition based on Multiple Feature Fusion
    Jiang, Changjiang
    Mao, Rong
    Liu, Geng
    Wang, Mingyi
    2019 CHINESE AUTOMATION CONGRESS (CAC2019), 2019, : 907 - 912
  • [48] Speech emotion recognition based on time domain feature
    Zhao, Lasheng
    Wei, Xiaopeng
    Zhang, Qiang
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE INFORMATION COMPUTING AND AUTOMATION, VOLS 1-3, 2008, : 1319 - 1321
  • [49] Emotion Recognition from Speech based on Relevant Feature and Majority Voting
    Sarker, Md Kamruzzaman
    Alam, Kazi Md Rokibul
    Arifuzzaman, Md
    2014 INTERNATIONAL CONFERENCE ON INFORMATICS, ELECTRONICS & VISION (ICIEV), 2014,
  • [50] Adapting Large-Scale Pre-trained Models for Uni ed Dialect Speech Recognition Model
    Toyama, T.
    Kai, A.
    Kamiya, Y.
    Takahashi, N.
    Acta Physica Polonica A, 2024, 146 (04) : 413 - 418