MTLSER: Multi-task learning enhanced speech emotion recognition with pre-trained acoustic model

被引:0
|
作者
Chen, Zengzhao [1 ,2 ]
Liu, Chuan [1 ]
Wang, Zhifeng [1 ]
Zhao, Chuanxu [1 ]
Lin, Mengting [1 ]
Zheng, Qiuyu [1 ]
机构
[1] Cent China Normal Univ, Fac Artificial Intelligence Educ, Wuhan 430079, Peoples R China
[2] Natl Intelligent Soc Governance Expt Base Educ, Wuhan 430079, Peoples R China
基金
中国国家自然科学基金;
关键词
Multi-task learning; Speech emotion recognition; Speaker identification; Automatic speech recognition; Speech representation learning;
D O I
10.1016/j.eswa.2025.126855
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This study proposes a novel Speech Emotion Recognition (SER) approach employing a Multi-Task Learning framework (MTLSER), designed to boost recognition accuracy by training multiple related tasks simultaneously and sharing information via a joint loss function. This framework integrates SER as the primary task, with Automatic Speech Recognition (ASR) and speaker identification serving as auxiliary tasks. Feature extraction is conducted using the pre-trained wav2vec2.0 model, which acts as a shared layer within our multi-task learning (MTL) framework. Extracted features are then processed in parallel by the three tasks. The contributions of auxiliary tasks are adjusted through hyperparameters, and their loss functions are amalgamated into a singular joint loss function for effective backpropagation. This optimization refines the model's internal parameters. Our method's efficacy is tested during the inference stage, where the model concurrently outputs the emotion, textual content, and speaker identity from the input audio. We conducted ablation studies and a sensitivity analysis on the hyperparameters to determine the optimal settings for emotion recognition. The performance of our proposed MTLSER model is evaluated using the public IEMOCAP dataset. Results from extensive testing show a significant improvement over traditional methods, achieving a Weighted Accuracy (WA) of 82.63% and an Unweighted Accuracy (UA) of 82.19%. These findings affirm the effectiveness and robustness of our approach. Our code is publicly available at https://github.com/CCNU-nercel-lc/MTL-SER.
引用
收藏
页数:16
相关论文
共 50 条
  • [21] Coarse-to-Fine Speech Emotion Recognition Based on Multi-Task Learning
    Zhao, Huijuan
    Ye, Ning
    Wang, Ruchuan
    JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, 2021, 93 (2-3): : 299 - 308
  • [22] MtArtGPT: A Multi-Task Art Generation System With Pre-Trained Transformer
    Jin, Cong
    Zhu, Ruolin
    Zhu, Zixing
    Yang, Lu
    Yang, Min
    Luo, Jiebo
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (08) : 6901 - 6912
  • [23] Multi-task Learning Based Online Dialogic Instruction Detection with Pre-trained Language Models
    Hao, Yang
    Li, Hang
    Ding, Wenbiao
    Wu, Zhongqin
    Tang, Jiliang
    Luckin, Rose
    Liu, Zitao
    ARTIFICIAL INTELLIGENCE IN EDUCATION (AIED 2021), PT II, 2021, 12749 : 183 - 189
  • [24] Speech Emotion Recognition Based on Multi-Task Learning Using a Convolutional Neural Network
    Kim, Nam Kyun
    Lee, Jiwon
    Ha, Hun Kyu
    Lee, Geon Woo
    Lee, Jung Hyuk
    Kim, Hong Kook
    2017 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC 2017), 2017, : 704 - 707
  • [25] SELECTIVE MULTI-TASK LEARNING FOR SPEECH EMOTION RECOGNITION USING CORPORA OF DIFFERENT STYLES
    Zhang, Heran
    Mimura, Masato
    Kawahara, Tatsuya
    Ishizuka, Kenkichi
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7707 - 7711
  • [26] Emotion Recognition With Sequential Multi-task Learning Technique
    Phan Tran Dac Thinh
    Hoang Manh Hung
    Yang, Hyung-Jeong
    Kim, Soo-Hyung
    Lee, Guee-Sang
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 3586 - 3589
  • [27] Depression recognition base on acoustic speech model of Multi-task emotional stimulus
    Xing, Yujuan
    Liu, Zhenyu
    Chen, Qiongqiong
    Li, Gang
    Ding, Zhijie
    Feng, Lei
    Hu, Bin
    BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2023, 85
  • [28] Speech Emotion: Investigating Model Representations, Multi-Task Learning and Knowledge Distillation
    Mitra, Vikramjit
    Chien, Hsiang-Yun Sherry
    Kowtha, Vasudha
    Cheng, Joseph Yitan
    Azemi, Erdrin
    INTERSPEECH 2022, 2022, : 4715 - 4719
  • [29] Towards multi-task learning of speech and speaker recognition
    Vaessen, Nik
    van Leeuwen, David A.
    INTERSPEECH 2023, 2023, : 4898 - 4902
  • [30] EmoComicNet: A multi-task model for comic emotion recognition
    Dutta, Arpita
    Biswas, Samit
    Das, Amit Kumar
    PATTERN RECOGNITION, 2024, 150