Waveform-Domain Speech Enhancement Using Spectrogram Encoding for Robust Speech Recognition

被引：4

作者：

Shi, Hao ^{[1
]}

Mimura, Masato ^{[1
]}

Kawahara, Tatsuya ^{[1
]}

机构：

[1] Kyoto Univ, Grad Sch Informat, Kyoto 6068501, Japan

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2024年 / 32卷

关键词：

Speech enhancement; robust automatic speech recognition (ASR); time-frequency hybrid model; spectral information refining; FRAMEWORK;

D O I：

10.1109/TASLP.2024.3407511

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

While waveform-domain speech enhancement (SE) has been extensively investigated in recent years and achieves state-of-the-art performance in many datasets, spectrogram-based SE tends to show robust and stable enhancement behavior. In this paper, we propose a waveform-spectrogram hybrid method (WaveSpecEnc) to improve the robustness of waveform-domain SE. WaveSpecEnc refines the corresponding temporal feature map by spectrogram encoding in each encoder layer. Incorporating spectral information provides robust human hearing experience performance. However, it has a minor automatic speech recognition (ASR) improvement. Thus, we improve it for robust ASR by further utilizing spectrogram encoding information (WaveSpecEnc+) to both the SE front-end and ASR back-end. Experimental results using the CHiME-4 dataset show that ASR performance in real evaluation sets is consistently improved with the proposed method, which outperformed others, including DEMUCS and Conv-Tasnet. Refining in the shallow encoder layers is very effective, and the effect is confirmed even with a strong ASR baseline using WavLM.

引用

页码：3049 / 3060

页数：12

共 50 条

[21] Speech Emotion Recognition Using Spectrogram & Phoneme Embedding
Yenigalla, Promod
Kumar, Abhay
Tripathi, Suraj
Singh, Chirag
Kar, Sibsambhu
Vepa, Jithendra
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 3688 - 3692
[22] Emotion recognition based on AlexNet using speech spectrogram
Park, Soeun
Lee, Chul
Kwon, Soonil
Park, Neungsoo
BASIC & CLINICAL PHARMACOLOGY & TOXICOLOGY, 2018, 123 : 49 - 49
[23] Domain Adaptation Using Class Similarity for Robust Speech Recognition
Zhu, Han
Zhao, Jiangjiang
Ren, Yuling
Wang, Li
Zhang, Pengyuan
INTERSPEECH 2020, 2020, : 4367 - 4371
[24] Incomplete spectrogram reconstruction with kalman filter for noise robust speech recognition
Mohammadi, Arash
Almasganj, Farshad
Sadrieh, Nima
Zandi, Alireza
2008 3RD INTERNATIONAL SYMPOSIUM ON COMMUNICATIONS, CONTROL AND SIGNAL PROCESSING, VOLS 1-3, 2008, : 814 - +
[25] Performance Analysis of Speech Enhancement Algorithm for Robust Speech Recognition System
Babu, C. Ganesh
Vanathi, P. T.
Ramachandran, R.
Rajaa, M. Senthil
RECENT ADVANCES IN NETWORKING, VLSI AND SIGNAL PROCESSING, 2010, : 197 - +
[26] Comparative Evaluation of Speech Enhancement Methods for Robust Automatic Speech Recognition
Paliwal, Kuldip K.
Lyons, James G.
So, Stephen
Stark, Anthony P.
Wojcicki, Kamil K.
2010 4TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATION SYSTEMS (ICSPCS), 2010,
[27] Combining speech enhancement and auditory feature extraction for robust speech recognition
Kleinschmidt, M
Tchorz, J
Kollmeier, B
SPEECH COMMUNICATION, 2001, 34 (1-2) : 75 - 91
[28] Combined speech enhancement and auditory modelling for robust distributed speech recognition
Flynn, Ronan
Jones, Edward
SPEECH COMMUNICATION, 2008, 50 (10) : 797 - 809
[29] EXPLORING SPEECH ENHANCEMENT WITH GENERATIVE ADVERSARIAL NETWORKS FOR ROBUST SPEECH RECOGNITION
Donahue, Chris
Li, Bo
Prabhavalkar, Rohit
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5024 - 5028
[30] A Modified Oesophageal Speech Enhancement Using Ephraim-Malah Filter For Robust Speech Recognition
Babu, C. Ganesh
Vanathi, P. T.
Dcruz, Jibby Peter
RECENT ADVANCES IN NETWORKING, VLSI AND SIGNAL PROCESSING, 2010, : 129 - +

← 1 2 3 4 5 →