Waveform-Domain Speech Enhancement Using Spectrogram Encoding for Robust Speech Recognition

被引:4
|
作者
Shi, Hao [1 ]
Mimura, Masato [1 ]
Kawahara, Tatsuya [1 ]
机构
[1] Kyoto Univ, Grad Sch Informat, Kyoto 6068501, Japan
关键词
Speech enhancement; robust automatic speech recognition (ASR); time-frequency hybrid model; spectral information refining; FRAMEWORK;
D O I
10.1109/TASLP.2024.3407511
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
While waveform-domain speech enhancement (SE) has been extensively investigated in recent years and achieves state-of-the-art performance in many datasets, spectrogram-based SE tends to show robust and stable enhancement behavior. In this paper, we propose a waveform-spectrogram hybrid method (WaveSpecEnc) to improve the robustness of waveform-domain SE. WaveSpecEnc refines the corresponding temporal feature map by spectrogram encoding in each encoder layer. Incorporating spectral information provides robust human hearing experience performance. However, it has a minor automatic speech recognition (ASR) improvement. Thus, we improve it for robust ASR by further utilizing spectrogram encoding information (WaveSpecEnc+) to both the SE front-end and ASR back-end. Experimental results using the CHiME-4 dataset show that ASR performance in real evaluation sets is consistently improved with the proposed method, which outperformed others, including DEMUCS and Conv-Tasnet. Refining in the shallow encoder layers is very effective, and the effect is confirmed even with a strong ASR baseline using WavLM.
引用
收藏
页码:3049 / 3060
页数:12
相关论文
共 50 条
  • [21] Speech Emotion Recognition Using Spectrogram & Phoneme Embedding
    Yenigalla, Promod
    Kumar, Abhay
    Tripathi, Suraj
    Singh, Chirag
    Kar, Sibsambhu
    Vepa, Jithendra
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 3688 - 3692
  • [22] Emotion recognition based on AlexNet using speech spectrogram
    Park, Soeun
    Lee, Chul
    Kwon, Soonil
    Park, Neungsoo
    BASIC & CLINICAL PHARMACOLOGY & TOXICOLOGY, 2018, 123 : 49 - 49
  • [23] Domain Adaptation Using Class Similarity for Robust Speech Recognition
    Zhu, Han
    Zhao, Jiangjiang
    Ren, Yuling
    Wang, Li
    Zhang, Pengyuan
    INTERSPEECH 2020, 2020, : 4367 - 4371
  • [24] Incomplete spectrogram reconstruction with kalman filter for noise robust speech recognition
    Mohammadi, Arash
    Almasganj, Farshad
    Sadrieh, Nima
    Zandi, Alireza
    2008 3RD INTERNATIONAL SYMPOSIUM ON COMMUNICATIONS, CONTROL AND SIGNAL PROCESSING, VOLS 1-3, 2008, : 814 - +
  • [25] Performance Analysis of Speech Enhancement Algorithm for Robust Speech Recognition System
    Babu, C. Ganesh
    Vanathi, P. T.
    Ramachandran, R.
    Rajaa, M. Senthil
    RECENT ADVANCES IN NETWORKING, VLSI AND SIGNAL PROCESSING, 2010, : 197 - +
  • [26] Comparative Evaluation of Speech Enhancement Methods for Robust Automatic Speech Recognition
    Paliwal, Kuldip K.
    Lyons, James G.
    So, Stephen
    Stark, Anthony P.
    Wojcicki, Kamil K.
    2010 4TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATION SYSTEMS (ICSPCS), 2010,
  • [27] Combining speech enhancement and auditory feature extraction for robust speech recognition
    Kleinschmidt, M
    Tchorz, J
    Kollmeier, B
    SPEECH COMMUNICATION, 2001, 34 (1-2) : 75 - 91
  • [28] Combined speech enhancement and auditory modelling for robust distributed speech recognition
    Flynn, Ronan
    Jones, Edward
    SPEECH COMMUNICATION, 2008, 50 (10) : 797 - 809
  • [29] EXPLORING SPEECH ENHANCEMENT WITH GENERATIVE ADVERSARIAL NETWORKS FOR ROBUST SPEECH RECOGNITION
    Donahue, Chris
    Li, Bo
    Prabhavalkar, Rohit
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5024 - 5028
  • [30] A Modified Oesophageal Speech Enhancement Using Ephraim-Malah Filter For Robust Speech Recognition
    Babu, C. Ganesh
    Vanathi, P. T.
    Dcruz, Jibby Peter
    RECENT ADVANCES IN NETWORKING, VLSI AND SIGNAL PROCESSING, 2010, : 129 - +