MTLSER: Multi-task learning enhanced speech emotion recognition with pre-trained acoustic model

被引:0
|
作者
Chen, Zengzhao [1 ,2 ]
Liu, Chuan [1 ]
Wang, Zhifeng [1 ]
Zhao, Chuanxu [1 ]
Lin, Mengting [1 ]
Zheng, Qiuyu [1 ]
机构
[1] Cent China Normal Univ, Fac Artificial Intelligence Educ, Wuhan 430079, Peoples R China
[2] Natl Intelligent Soc Governance Expt Base Educ, Wuhan 430079, Peoples R China
基金
中国国家自然科学基金;
关键词
Multi-task learning; Speech emotion recognition; Speaker identification; Automatic speech recognition; Speech representation learning;
D O I
10.1016/j.eswa.2025.126855
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This study proposes a novel Speech Emotion Recognition (SER) approach employing a Multi-Task Learning framework (MTLSER), designed to boost recognition accuracy by training multiple related tasks simultaneously and sharing information via a joint loss function. This framework integrates SER as the primary task, with Automatic Speech Recognition (ASR) and speaker identification serving as auxiliary tasks. Feature extraction is conducted using the pre-trained wav2vec2.0 model, which acts as a shared layer within our multi-task learning (MTL) framework. Extracted features are then processed in parallel by the three tasks. The contributions of auxiliary tasks are adjusted through hyperparameters, and their loss functions are amalgamated into a singular joint loss function for effective backpropagation. This optimization refines the model's internal parameters. Our method's efficacy is tested during the inference stage, where the model concurrently outputs the emotion, textual content, and speaker identity from the input audio. We conducted ablation studies and a sensitivity analysis on the hyperparameters to determine the optimal settings for emotion recognition. The performance of our proposed MTLSER model is evaluated using the public IEMOCAP dataset. Results from extensive testing show a significant improvement over traditional methods, achieving a Weighted Accuracy (WA) of 82.63% and an Unweighted Accuracy (UA) of 82.19%. These findings affirm the effectiveness and robustness of our approach. Our code is publicly available at https://github.com/CCNU-nercel-lc/MTL-SER.
引用
收藏
页数:16
相关论文
共 50 条
  • [31] JiuZhang 2.0: A Unified Chinese Pre-trained Language Model for Multi-task Mathematical Problem Solving
    Zhao, Wayne Xin
    Zhou, Kun
    Zhang, Beichen
    Gong, Zheng
    Chen, Zhipeng
    Zhou, Yuanhang
    Wen, Ji-Rong
    Sha, Jing
    Wang, Shijin
    Liu, Cong
    Hu, Guoping
    PROCEEDINGS OF THE 29TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2023, 2023, : 5660 - 5672
  • [32] MTLink: Adaptive multi-task learning based pre-trained language model for traceability link recovery between issues and commits
    Deng, Yang
    Wang, Bangchao
    Zhu, Qiang
    Liu, Junping
    Kuang, Jiewen
    Li, Xingfu
    JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2024, 36 (02)
  • [33] Transformer-based transfer learning and multi-task learning for improving the performance of speech emotion recognition
    Park, Sunchan
    Kim, Hyung Soon
    JOURNAL OF THE ACOUSTICAL SOCIETY OF KOREA, 2021, 40 (05): : 515 - 522
  • [34] Multi-Task Conformer with Multi-Feature Combination for Speech Emotion Recognition
    Seo, Jiyoung
    Lee, Bowon
    SYMMETRY-BASEL, 2022, 14 (07):
  • [35] When to Use Multi-Task Learning vs Intermediate Fine-Tuning for Pre-Trained Encoder Transfer Learning
    Weller, Orion
    Seppi, Kevin
    Gardner, Matt
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022): (SHORT PAPERS), VOL 2, 2022, : 272 - 282
  • [36] Automatic Speech Recognition Dataset Augmentation with Pre-Trained Model and Script
    Kwon, Minsu
    Choi, Ho-Jin
    2019 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP), 2019, : 649 - 651
  • [37] Cross-Corpus Speech Emotion Recognition Based on Multi-Task Learning and Subdomain Adaptation
    Fu, Hongliang
    Zhuang, Zhihao
    Wang, Yang
    Huang, Chen
    Duan, Wenzhuo
    ENTROPY, 2023, 25 (01)
  • [38] Towards Speech Emotion Recognition "in the wild" using Aggregated Corpora and Deep Multi-Task Learning
    Kim, Jaebok
    Englebienne, Gwenn
    Truong, Khiet P.
    Evers, Vanessa
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 1113 - 1117
  • [39] MASTER: Multi-task Pre-trained Bottlenecked Masked Autoencoders Are Better Dense Retrievers
    Zhou, Kun
    Liu, Xiao
    Gong, Yeyun
    Zhao, Wayne Xin
    Jiang, Daxin
    Duan, Nan
    Wen, Ji-Rong
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES: RESEARCH TRACK, ECML PKDD 2023, PT II, 2023, 14170 : 630 - 647
  • [40] Multi-task Recurrent Model for Speech and Speaker Recognition
    Tang, Zhiyuan
    Li, Lantian
    Wang, Dong
    2016 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2016,