Speech emotion recognition with embedded attention mechanism and hierarchical context

被引:0
|
作者
Cheng Y. [1 ]
Chen Y. [2 ]
Chen Y. [2 ]
Yang Y. [1 ]
机构
[1] School of Computer Science and Technology, Wuhan University of Technology, Wuhan
[2] School of Computer, Hubei University of Technology, Wuhan
关键词
Attention mechanism; BLSTM; Context; Speech emotion recognition;
D O I
10.11918/j.issn.0367-6234.201905193
中图分类号
学科分类号
摘要
A challenging task remains with regarding to speech emotion recognition due to issues such as emotional corpus problems, association between emotion and acoustic features, and speech emotion recognition modeling. Conventional context-based speech emotion recognition system risks of losing the context details of the label layer and neglecting the difference of the two-level due to solely limited to the feature layer. This paper proposed a Bidirectional Long Short-Term Memory (BLSTM) network with embedded attention mechanism combined with hierarchical context learning model. The model completed the speech emotion recognition task in three phases. The first phase extracted the feature set from the emotional speech, then used the SVM-RFE feature-sorting algorithm to reduce the feature in order to obtain the optimal feature subset and assigned attention weights. The second phase, the weighted feature subset was input into the BLSTM network learning feature layer context to obtain the initial emotional prediction result. The third phase used the emotional value to train another independent BLSTM network for learning label layer context information. According to the information, the final prediction was completed based on the output result of the second phase. The model embedded the attention mechanism to automatically learn to adjust the attention to the input feature subset, introduced the label layer context to associate the feature layer context so as to achieve the hierarchical context information fusion and improve the robustness, and improved the model's ability to model the emotional speech. The experimental results on the SEMAINE and RECOLA datasets showed that both RMSE and CCC were significantly improved than the baseline model. © 2019, Editorial Board of Journal of Harbin Institute of Technology. All right reserved.
引用
收藏
页码:100 / 107
页数:7
相关论文
共 24 条
  • [1] Schuller B.W., Speech emotion recognition: Two decades in a nutshell, benchmarks, and ongoing trends, Communications of the Acm, 61, 5, (2018)
  • [2] Jing S., Mao X., Chen L., Prominence features: Effective emotional features for speech emotion recognition, Digital Signal Processing, 72, (2018)
  • [3] Swain M., Routray A., Kabisatpathy P., Databases, features and classifiers for speech emotion recognition: A review, International Journal of Speech Technology, 21, 1, (2018)
  • [4] Tzinis E., Potamianos A., Segment-based speech emotion recognition using recurrent neural networks, Seventh International Conference on Affective Computing & Intelligent Interaction, (2018)
  • [5] Mustafa M.B., Yusoof M.A.M., Don Z.M., Et al., Speech emotion recognition research: An analysis of research focus, International Journal of Speech Technology, 21, 1, (2018)
  • [6] Sariyanidi E., Gunes H., Cavallaro A., Automatic analysis of facial affect: A survey of registration, representation, and recognition, IEEE Transactions on Pattern Analysis & Machine Intelligence, 37, 6, (2015)
  • [7] Shao L., Zhen X., Tao D., Et al., Spatio-temporal laplacian pyramid coding for action recognition, IEEE Transactions on Cybernetics, 44, 6, (2014)
  • [8] Meng H., Bianchiberthouze N., Affective state level recognition in naturalistic facial and vocal expressions, IEEE Transactions on Cybernetics, 44, 3, (2013)
  • [9] Rodriguez P., Cucurull G., Gonalez J., Et al., Deep pain: Exploiting long short-term memory metworks for facial expression classification, IEEE Transactions on Cybernetics, (2017)
  • [10] Ringeval F., Eyben F., Kroupi E., Et al., Prediction of asynchronous dimensional emotion ratings from audiovisual and physiological data, Pattern Recognition Letters, (2014)