Discriminative Feature Representation Based on Cascaded Attention Network with Adversarial Joint Loss for Speech Emotion Recognition

被引：5

作者：

Liu, Yang ^{[1
]}

Sun, Haoqin ^{[1
]}

Guan, Wenbo ^{[1
]}

Xia, Yuqi ^{[1
]}

Zhao, Zhen ^{[1
]}

机构：

[1] Qingdao Univ Sci & Technol, Sch Informat Sci & Technol, Qingdao 266061, Peoples R China

来源：

INTERSPEECH 2022 | 2022年

关键词：

Speech Emotion Recognition; Three-channel Features; Cascaded Attention Network; Adversarial Joint Loss;

D O I：

10.21437/Interspeech.2022-11480

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Accurately recognizing emotion from speech is a necessary yet challenging task due to its complexity. A common problem existing in most of the previous studies is that some of the particular emotions are severely misclassified. In this paper, we propose a novel framework integrating cascaded attention and adversarial joint loss for speech emotion recognition, aiming at discriminating the confusions by emphasizing more on the emotions which are difficult to be correctly classified. Specifically, we propose a cascaded attention network to extract effective emotional features, where spatiotemporal attention selectively locates the targeted emotional regions from the input features. In these targeted regions, the self-attention with head fusion captures the long-distance dependence of temporal features. Furthermore, an adversarial joint loss strategy is proposed to distinguish the emotional embeddings with high similarity by the generated hard triplets in an adversarial fashion. Experimental results on the benchmark dataset IEMOCAP demonstrate that our method gains an absolute improvement of 3.17% and 0.39% over state-of-the-art strategies in terms of weighted accuracy (WA) and unweighted accuracy (UA), respectively.

引用

页码：4750 / 4754

页数：5

共 50 条

[31] Speech Emotion Recognition Based on Sparse Representation
Yan, Jingjie
Wang, Xiaolan
Gu, Weiyi
Ma, Lili
ARCHIVES OF ACOUSTICS, 2013, 38 (04) : 465 - 470
[32] High-level Feature Representation using Recurrent Neural Network for Speech Emotion Recognition
Lee, Jinkyu
Tashev, Ivan
16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 1537 - 1540
[33] Speech Emotion Recognition Based on Feature Fusion
Shen, Qi
Chen, Guanggen
Chang, Lin
PROCEEDINGS OF THE 2017 2ND INTERNATIONAL CONFERENCE ON MATERIALS SCIENCE, MACHINERY AND ENERGY ENGINEERING (MSMEE 2017), 2017, 123 : 1071 - 1074
[34] Upgraded Attention-Based Local Feature Learning Block for Speech Emotion Recognition
Zhao, Huan
Gao, Yingxue
Xiao, Yufeng
ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2021, PT II, 2021, 12713 : 118 - 130
[35] Metric Learning Based Feature Representation with Gated Fusion Model for Speech Emotion Recognition
Gao, Yuan
Liu, JiaXing
Wang, Longbiao
Dang, Jianwu
INTERSPEECH 2021, 2021, : 4503 - 4507
[36] A Parallel-Model Speech Emotion Recognition Network Based on Feature Clustering
Zhang, Li-Min
Ng, Giap Weng
Leau, Yu-Beng
Yan, Hao
IEEE ACCESS, 2023, 11 : 71224 - 71234
[37] Speech Emotion Recognition Based on Robust Discriminative Sparse Regression
Song, Peng
Zheng, Wenming
Yu, Yanwei
Ou, Shifeng
IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, 2021, 13 (02) : 343 - 353
[38] An adversarial discriminative temporal convolutional network for EEG-based cross-domain emotion recognition
He, Zhipeng
Zhong, Yongshi
Pan, Jiahui
COMPUTERS IN BIOLOGY AND MEDICINE, 2022, 141
[39] Coordination Attention based Transformers with bidirectional contrastive loss for multimodal speech emotion recognition
Fan, Weiquan
Xu, Xiangmin
Zhou, Guohua
Deng, Xiaofang
Xing, Xiaofen
SPEECH COMMUNICATION, 2025, 169
[40] DeepCNN: Spectro-temporal feature representation for speech emotion recognition
Saleem, Nasir
Gao, Jiechao
Irfan, Rizwana
Almadhor, Ahmad
Rauf, Hafiz Tayyab
Zhang, Yudong
Kadry, Seifedine
CAAI TRANSACTIONS ON INTELLIGENCE TECHNOLOGY, 2023, 8 (02) : 401 - 417

← 1 2 3 4 5 →