Improving Pre-Trained Model-Based Speech Emotion Recognition From a Low-Level Speech Feature Perspective

被引：1

作者：

Liu, Ke ^{[1
]}

Wei, Jiwei ^{[1
]}

Zou, Jie ^{[1
]}

Wang, Peng ^{[1
]}

Yang, Yang ^{[1
,2
]}

Shen, Heng Tao ^{[1
,3
]}

机构：

[1] Univ Elect Sci & Technol China, Ctr Future Media, Sch Comp Sci & Engn, Chengdu 611731, Peoples R China

[2] Univ Elect Sci & Technol China, Inst Elect & Informat Engn, Dongguan 523808, Peoples R China

[3] Peng Cheng Lab, Shenzhen 518066, Peoples R China

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2024年 / 26卷

基金：

中国博士后科学基金; 中国国家自然科学基金;

关键词：

Feature extraction; Mel frequency cepstral coefficient; Task analysis; Speech recognition; Emotion recognition; Representation learning; Fuses; Multi-view learning; discriminative learning; pre-trained model; MFCC; speech emotion recognition; MULTIVIEW; CLASSIFICATION; ATTENTION;

D O I：

10.1109/TMM.2024.3410133

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Multi-view speech emotion recognition (SER) based on the pre-trained model has gained attention in the last two years, which shows great potential in improving the model performance in speaker-independent scenarios. However, the existing work either relies on various fine-tuning methods or uses excessive feature views with complex fusion strategies, causing the increase of complexity with limited performance benefit. In this paper, we improve multi-view SER based on the pre-trained model from the perspective of a low-level speech feature. Specifically, we forgo fine-tuning the pre-trained model and instead focus on learning effective features hidden in the low-level speech feature mel-scale frequency cepstral coefficient (MFCC). We propose a two-stream pooling channel attention (TsPCA) module to discriminatively weight the channel dimensions of the features derived from MFCC. This module enables inter-channel interaction and learning of emotion sequence information across channels. Furthermore, we design a simple but effective feature view fusion strategy to learn robust representations. In the comparison experiments, our method achieves the WA and UA of 73.97%/74.69% and 74.61%/75.66% on the IEMOCAP dataset, 97.21% and 97.11% on the Emo-DB dataset, 77.08% and 77.34% on the RAVDESS dataset, and 74.38% and 71.43% on the SAVEE dataset. Extensive experiments on the four datasets demonstrate that our method consistently surpasses existing methods and achieves a new State-of-the-Art result.

引用

页码：10623 / 10636

页数：14

共 50 条

[41] End-to-end speech topic classification based on pre-trained model Wavlm
Cao, Tengfei
He, Liang
Niu, Fangjing
2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2022, : 369 - 373
[42] Chinese cyber-violent Speech Detection and Analysis Based on Pre-trained Model
Zhou, Sunrui
2024 5TH INTERNATIONAL CONFERENCE ON COMPUTING, NETWORKS AND INTERNET OF THINGS, CNIOT 2024, 2024, : 443 - 447
[43] Comparing Pre-Trained Language Model for Arabic Hate Speech Detection
Daouadi, Kheir Eddine
Boualleg, Yaakoub
Guehairia, Oussama
COMPUTACION Y SISTEMAS, 2024, 28 (02): : 681 - 693
[44] Automatic Prosody Annotation with Pre-Trained Text-Speech Model
Dai, Ziqian
Yu, Jianwei
Wang, Yan
Chen, Nuo
Bian, Yanyao
Li, Guangzhi
Cai, Deng
Yu, Dong
INTERSPEECH 2022, 2022, : 5513 - 5517
[45] Pre-trained Model Based Feature Envy Detection
Ma, Wenhao
Yu, Yaoxiang
Ruan, Xiaoming
Cai, Bo
2023 IEEE/ACM 20TH INTERNATIONAL CONFERENCE ON MINING SOFTWARE REPOSITORIES, MSR, 2023, : 430 - 440
[46] IMPROVING NON-AUTOREGRESSIVE END-TO-END SPEECH RECOGNITION WITH PRE-TRAINED ACOUSTIC AND LANGUAGE MODELS
Deng, Keqi
Yang, Zehui
Watanabe, Shinji
Higuchi, Yosuke
Cheng, Gaofeng
Zhang, Pengyuan
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8522 - 8526
[47] Speech Emotion Recognition based on Multiple Feature Fusion
Jiang, Changjiang
Mao, Rong
Liu, Geng
Wang, Mingyi
2019 CHINESE AUTOMATION CONGRESS (CAC2019), 2019, : 907 - 912
[48] Speech emotion recognition based on time domain feature
Zhao, Lasheng
Wei, Xiaopeng
Zhang, Qiang
PROCEEDINGS OF THE INTERNATIONAL CONFERENCE INFORMATION COMPUTING AND AUTOMATION, VOLS 1-3, 2008, : 1319 - 1321
[49] Emotion Recognition from Speech based on Relevant Feature and Majority Voting
Sarker, Md Kamruzzaman
Alam, Kazi Md Rokibul
Arifuzzaman, Md
2014 INTERNATIONAL CONFERENCE ON INFORMATICS, ELECTRONICS & VISION (ICIEV), 2014,
[50] Adapting Large-Scale Pre-trained Models for Uni ed Dialect Speech Recognition Model
Toyama, T.
Kai, A.
Kamiya, Y.
Takahashi, N.
Acta Physica Polonica A, 2024, 146 (04) : 413 - 418

← 1 2 3 4 5 →