Selective Acoustic Feature Enhancement for Speech Emotion Recognition With Noisy Speech

被引:0
|
作者
Leem, Seong-Gyun [1 ]
Fulford, Daniel [2 ]
Onnela, Jukka-Pekka [3 ]
Gard, David [4 ]
Busso, Carlos [1 ]
机构
[1] Univ Texas Dallas, Dept Elect & Comp Engn, Richardson, TX 75080 USA
[2] Boston Univ, Occupat Therapy & Psychol & Brain Sci, Boston, MA 02215 USA
[3] Harvard Univ, Harvard TH Chan Sch Publ Hlth, Dept Biostat, Cambridge, MA 02138 USA
[4] San Francisco State Univ, Dept Psychol, San Francisco 94132, CA USA
基金
美国国家卫生研究院;
关键词
Speech enhancement; Noise measurement; Speech recognition; Task analysis; Acoustics; Recording; Training; Feature selection; noisy speech; speech enhancement; speech emotion recognition; MODEL; CORPUS;
D O I
10.1109/TASLP.2023.3340603
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
A speech emotion recognition (SER) system deployed on a real-world application can encounter speech contaminated with unconstrained background noise. To deal with this issue, a speech enhancement (SE) module can be attached to the SER system to compensate for the environmental difference of an input. Although the SE module can improve the quality and intelligibility of a given speech, there is a risk of affecting discriminative acoustic features for SER that are resilient to environmental differences. Exploring this idea, we propose to enhance only weak features that degrade the emotion recognition performance. Our model first identifies weak feature sets by using multiple models trained with one acoustic feature at a time using clean speech. After training the single-feature models, we rank each speech feature by measuring three criteria: performance, robustness, and a joint rank ranking that combines performance and robustness. We group the weak features by cumulatively incrementing the features from the bottom to the top of each rank. Once the weak feature set is defined, we only enhance those weak features, keeping the resilient features unchanged. We implement these ideas with the low-level descriptors (LLDs). We show that directly enhancing the weak LLDs leads to better performance than extracting LLDs from an enhanced speech signal. Our experiment with clean and noisy versions of the MSP-Podcast corpus shows that the proposed approach yields a 17.7% (arousal), 21.2% (dominance), and 3.3% (valence) performance gains over a system that enhances all the LLDs for the 10dB signal-to-noise ratio (SNR) condition.
引用
收藏
页码:917 / 929
页数:13
相关论文
共 50 条
  • [21] COMPARISON OF DIFFERENT SPEECH ENHANCEMENT METHODS ON RECOGNITION OF NOISY SPEECH
    AHMED, MS
    ALMARZOUG, AM
    ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 1994, 19 (01): : 45 - 56
  • [22] Feature representation for speech emotion Recognition
    Abdollahpour, Mehdi
    Zamani, Lafar
    Rad, Hamidreza Saligheh
    2017 25TH IRANIAN CONFERENCE ON ELECTRICAL ENGINEERING (ICEE), 2017, : 1465 - 1468
  • [23] On the Speech Properties and Feature Extraction Methods in Speech Emotion Recognition
    Kacur, Juraj
    Puterka, Boris
    Pavlovicova, Jarmila
    Oravec, Milos
    SENSORS, 2021, 21 (05) : 1 - 27
  • [24] Speech Emotion Recognition Using Speech Feature and Word Embedding
    Atmaja, Bagus Tris
    Shirai, Kiyoaki
    Akagi, Masato
    2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 519 - 523
  • [25] SPEECH RECOGNITION WITH NO SPEECH OR WITH NOISY SPEECH
    Krishna, Gautam
    Co Tran
    Yu, Jianguo
    Tewfik, Ahmed H.
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 1090 - 1094
  • [26] Temporal feature selection for noisy speech recognition
    Department of Computer Science and Software Engineering, Université Laval, Quebec
    QC
    G1V 0A6, Canada
    Lect. Notes Comput. Sci., (155-166):
  • [27] Temporal Feature Selection for Noisy Speech Recognition
    Trottier, Ludovic
    Chaib-draa, Brahim
    Giguere, Philippe
    ADVANCES IN ARTIFICIAL INTELLIGENCE (AI 2015), 2015, 9091 : 155 - 166
  • [28] Recognition of Noisy Speech: A Comparative Survey of Robust Model Architecture and Feature Enhancement
    Schuller, Bjoern
    Woellmer, Martin
    Moosmayr, Tobias
    Rigoll, Gerhard
    EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2009,
  • [29] Recognition of Noisy Speech: A Comparative Survey of Robust Model Architecture and Feature Enhancement
    Björn Schuller
    Martin Wöllmer
    Tobias Moosmayr
    Gerhard Rigoll
    EURASIP Journal on Audio, Speech, and Music Processing, 2009
  • [30] Investigating Speech Enhancement and Perceptual Quality for Speech Emotion Recognition
    Avila, Anderson R.
    Alam, Jahangir
    O'Shaughnessy, Douglas
    Falk, Tiago H.
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 3663 - 3667