Modeling Speech Structure to Improve T-F Masks for Speech Enhancement and Recognition

被引:3
|
作者
Bu, Suliang [1 ]
Zhao, Yunxin [1 ]
Zhao, Tuo [1 ]
Wang, Shaojun [2 ]
Han, Mei [2 ]
机构
[1] Univ Missouri, Dept Elect Engn & Comp Sci, Spoken Language & Informat Proc Lab, Columbia, MO 65211 USA
[2] PAII Inc, Palo Alto, CA USA
关键词
Speech enhancement; Noise measurement; Speech recognition; Artificial neural networks; Training; Estimation; Spectrogram; Time-frequency masks; beamforming; speech enhancement and recognition; speech region; UNet plus plus; SEPARATION; BINARY; INTELLIGIBILITY; LIKELIHOOD; CHALLENGE; NOISE; TIME;
D O I
10.1109/TASLP.2022.3196168
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Time-frequency (TF) masks are widely used in speech enhancement (SE). However, accurately estimating TF masks from noisy speech remains a challenge to both statistical or neural network (NN) approaches. Statistical model based mask estimation usually depends on a good parameter initialization, while NN-based method relies on setting proper and stable learning targets. To address these issues, we propose to extract TF speech structure from clean speech and partition noisy speech spectrogram into mutually exclusive regions. We investigate modeling clean speech by utterance-specific narrowband complex Gaussian mixture models to derive the regions, and using the region targets to supervise the training of UNet++, a high-performance NN, for predicting regions from noisy speech. For multichannel SE, we consider two scenarios of using speech regions: 1) integrating the regions with TF masks by constraining the mask values or the model parameter updates, and 2) using the predicted regions in place of TF masks. For single-channel SE, we consider using the region targets to improve TF mask targets. Furthermore, we propose to use UNet++ for TF mask estimation. Our experiment results on speech recognition (CHiME-3) and SE (CHiME-3 and LibriSpeech) have demonstrated the effectiveness of our proposed approach of modeling speech region structure to improve TF masks for speech recognition and enhancement.
引用
收藏
页码:2705 / 2715
页数:11
相关论文
共 50 条
  • [1] Speech Enhancement Integrating the MVDR Beamforming and T-F Masking
    Zhu, Jinru
    Bao, Changchun
    Cheng, Rui
    CONFERENCE PROCEEDINGS OF 2019 IEEE INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING, COMMUNICATIONS AND COMPUTING (IEEE ICSPCC 2019), 2019,
  • [2] A Joint Learning Algorithm for Complex-Valued T-F Masks in Deep Learning-Based Single-Channel Speech Enhancement Systems
    Lee, Jinkyu
    Kang, Hong-Goo
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (06) : 1098 - 1109
  • [3] Learning Speech Structure to Improve Time-Frequency Masks
    Bu, Suliang
    Zhao, Yunxin
    Wang, Shaojun
    Han, Mei
    INTERSPEECH 2021, 2021, : 2731 - 2735
  • [4] Speech enhancement with Gamma speech modeling
    Zou, Xia
    Chen, Liang
    Zhang, Xiong-Wei
    Tongxin Xuebao/Journal on Communications, 2006, 27 (10): : 118 - 123
  • [5] Modeling auditory perception to improve robust speech recognition
    Strope, B
    Alwan, A
    THIRTY-FIRST ASILOMAR CONFERENCE ON SIGNALS, SYSTEMS & COMPUTERS, VOLS 1 AND 2, 1998, : 1056 - 1060
  • [6] β-Masking MMSE Speech Enhancement for Speech Recognition
    You, Chang Huai
    Ma, Bin
    2017 IEEE 2ND INTERNATIONAL CONFERENCE ON SIGNAL AND IMAGE PROCESSING (ICSIP), 2017, : 341 - 345
  • [7] NETWORKS FOR SPEECH ENHANCEMENT AND AUTOMATIC SPEECH RECOGNITION
    Vu, Thanh T.
    Bigot, Benjamin
    Chng, Eng Siong
    2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 499 - 503
  • [8] SPEECH ENHANCEMENT FOR TELEPHONY NAME SPEECH RECOGNITION
    You, Chang Huai
    Rahardja, Susanto
    Li, Haizhou
    2008 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOLS 1-4, 2008, : 973 - 976
  • [9] Noisy speech recognition based on speech enhancement
    Wang, Xia
    Tang, Hongmei
    Zhao, Xiaoqun
    SNPD 2007: EIGHTH ACIS INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING, AND PARALLEL/DISTRIBUTED COMPUTING, VOL 3, PROCEEDINGS, 2007, : 713 - +
  • [10] MODIFICATION ON LSA SPEECH ENHANCEMENT FOR SPEECH RECOGNITION
    You, Chang Huai
    Ma, Bin
    Ni, Chongjia
    2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 5475 - 5479