Modeling Speech Structure to Improve T-F Masks for Speech Enhancement and Recognition

被引:3
|
作者
Bu, Suliang [1 ]
Zhao, Yunxin [1 ]
Zhao, Tuo [1 ]
Wang, Shaojun [2 ]
Han, Mei [2 ]
机构
[1] Univ Missouri, Dept Elect Engn & Comp Sci, Spoken Language & Informat Proc Lab, Columbia, MO 65211 USA
[2] PAII Inc, Palo Alto, CA USA
关键词
Speech enhancement; Noise measurement; Speech recognition; Artificial neural networks; Training; Estimation; Spectrogram; Time-frequency masks; beamforming; speech enhancement and recognition; speech region; UNet plus plus; SEPARATION; BINARY; INTELLIGIBILITY; LIKELIHOOD; CHALLENGE; NOISE; TIME;
D O I
10.1109/TASLP.2022.3196168
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Time-frequency (TF) masks are widely used in speech enhancement (SE). However, accurately estimating TF masks from noisy speech remains a challenge to both statistical or neural network (NN) approaches. Statistical model based mask estimation usually depends on a good parameter initialization, while NN-based method relies on setting proper and stable learning targets. To address these issues, we propose to extract TF speech structure from clean speech and partition noisy speech spectrogram into mutually exclusive regions. We investigate modeling clean speech by utterance-specific narrowband complex Gaussian mixture models to derive the regions, and using the region targets to supervise the training of UNet++, a high-performance NN, for predicting regions from noisy speech. For multichannel SE, we consider two scenarios of using speech regions: 1) integrating the regions with TF masks by constraining the mask values or the model parameter updates, and 2) using the predicted regions in place of TF masks. For single-channel SE, we consider using the region targets to improve TF mask targets. Furthermore, we propose to use UNet++ for TF mask estimation. Our experiment results on speech recognition (CHiME-3) and SE (CHiME-3 and LibriSpeech) have demonstrated the effectiveness of our proposed approach of modeling speech region structure to improve TF masks for speech recognition and enhancement.
引用
收藏
页码:2705 / 2715
页数:11
相关论文
共 50 条
  • [11] Real-Time Contrast Enhancement to Improve Speech Recognition
    Alexander, Joshua M.
    Jenison, Rick L.
    Kluender, Keith R.
    PLOS ONE, 2011, 6 (09):
  • [12] The effects of face masks on speech-in-speech recognition for children and adults
    Flaherty, Mary M.
    Arzuaga, Briana
    Bottalico, Pasquale
    INTERNATIONAL JOURNAL OF AUDIOLOGY, 2023, 62 (11) : 1014 - 1021
  • [13] Interactive Speech and Noise Modeling for Speech Enhancement
    Zheng, Chengyu
    Peng, Xiulian
    Zhang, Yuan
    Srinivasan, Sriram
    Lu, Yan
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 14549 - 14557
  • [14] Speech enhancement for Distributed Speech Recognition in mobile devices
    Flynn, Ronan
    Jones, Edward
    2008 DIGEST OF TECHNICAL PAPERS INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS, 2008, : 233 - +
  • [15] Spectral-domain speech enhancement for speech recognition
    You, Chang Huai
    Ma, Bin
    SPEECH COMMUNICATION, 2017, 94 : 30 - 41
  • [16] CONTINUOUS VISUAL SPEECH RECOGNITION FOR AUDIO SPEECH ENHANCEMENT
    Benhaim, Eric
    Sahbi, Hichem
    Vitte, Guillaume
    2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 2244 - 2248
  • [17] SPEECH ENHANCEMENT FOR ROBUST SPEECH RECOGNITION IN MOTORCYCLE ENVIRONMENT
    Mporas, Iosif
    Ganchev, Todor
    Kocsis, Otilia
    Fakotakis, Nikos
    INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS, 2010, 19 (02) : 159 - 173
  • [18] Robust distributed speech recognition using speech enhancement
    Flynn, Ronan
    Jones, Edward
    IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, 2008, 54 (03) : 1267 - 1273
  • [19] Compensation of speech enhancement distortion for robust speech recognition
    Ding, P
    Cao, ZG
    2002 IEEE REGION 10 CONFERENCE ON COMPUTERS, COMMUNICATIONS, CONTROL AND POWER ENGINEERING, VOLS I-III, PROCEEDINGS, 2002, : 449 - 452
  • [20] Speech enhancement applied to speech recognition in noisy environments
    Xu, Y.F., 2001, Press of Tsinghua University (41):