A UNIFIED DEEP MODELING APPROACH TO SIMULTANEOUS SPEECH DEREVERBERATION AND RECOGNITION FOR THE REVERB CHALLENGE

被引:0
|
作者
Wu, Bo [1 ]
Li, Kehuang [2 ]
Huang, Zhen [2 ]
Siniscalchi, Sabato Marco [2 ,3 ]
Yang, Minglei [1 ]
Lee, Chin-Hui [2 ]
机构
[1] Xidian Univ, Natl Lab Radar Signal Proc, Xian, Shaanxi, Peoples R China
[2] Georgia Inst Technol, Sch Elect & Comp Engn, Atlanta, GA 30332 USA
[3] Univ Enna Kore, I-94100 Enna, Italy
关键词
Signal space robustness; deep modeling; reverberant speech enhancement; robust speech recognition; SUPPRESSION;
D O I
暂无
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
We propose a unified deep neural network (DNN) approach to achieve both high-quality enhanced speech and high-accuracy automatic speech recognition (ASR) simultaneously on the recent REverberant Voice Enhancement and Recognition Benchmark (RE-VERB) Challenge. These two goals are accomplished by two proposed techniques, namely DNN-based regression to enhance reverberant and noisy speech, followed by DNN-based multi-condition training that takes clean-condition, multi-condition and enhanced speech all into consideration. We first report on superior objective measures in enhanced speech to those listed in the 2014 REVERB Challenge Workshop. We then show that in clean-condition training, we obtain the best word error rate (WER) of 13.28% on the 1-channel REVERB simulated evaluation data with the proposed DNN-based pre-processing scheme. Similarly we attain a competitive single-system WER of 8.75% with the proposed multi-condition training strategy and the same less-discriminative log power spectrum features used in the enhancement stage. Finally by leveraging upon joint training with more discriminative ASR features and improved neural network based language models a state-of-the-art WER of 4.46% is attained with a single ASR system, and single-channel information. Another state-of-the-art WER of 4.10% is achieved through system combination.
引用
收藏
页码:36 / 40
页数:5
相关论文
共 50 条
  • [1] THE REVERB CHALLENGE: A COMMON EVALUATION FRAMEWORK FOR DEREVERBERATION AND RECOGNITION OF REVERBERANT SPEECH
    Kinoshita, Keisuke
    Delcroix, Marc
    Yoshioka, Takuya
    Nakatani, Tomohiro
    Habets, Emanuel
    Haeb-Umbach, Reinhold
    Leutnant, Volker
    Sehr, Armin
    Kellermann, Walter
    Maas, Roland
    Gannot, Sharon
    Raj, Bhiksha
    2013 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS (WASPAA), 2013,
  • [2] An End-to-End Deep Learning Approach to Simultaneous Speech Dereverberation and Acoustic Modeling for Robust Speech Recognition
    Wu, Bo
    Li, Kehuang
    Ge, Fengpei
    Huang, Zhen
    Yang, Minglei
    Siniscalchi, Sabato Marco
    Lee, Chin-Hui
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2017, 11 (08) : 1289 - 1300
  • [3] A context aware-based deep neural network approach for simultaneous speech denoising and dereverberation
    Sidheswar Routray
    Qirong Mao
    Neural Computing and Applications, 2022, 34 : 9831 - 9845
  • [4] A context aware-based deep neural network approach for simultaneous speech denoising and dereverberation
    Routray, Sidheswar
    Mao, Qirong
    NEURAL COMPUTING & APPLICATIONS, 2022, 34 (12): : 9831 - 9845
  • [5] Joint Training for Simultaneous Speech Denoising and Dereverberation with Deep Embedding Representations
    Fan, Cunhang
    Tao, Jianhua
    Liu, Bin
    Yi, Jiangyan
    Wen, Zhengqi
    INTERSPEECH 2020, 2020, : 4536 - 4540
  • [6] Deep Learning Based Dereverberation of Temporal Envelopes for Robust Speech Recognition
    Purushothaman, Anurenjan
    Sreeram, Anirudh
    Kumar, Rohit
    Ganapathy, Sriram
    INTERSPEECH 2020, 2020, : 1688 - 1692
  • [7] SPEECH FEATURE DENOISING AND DEREVERBERATION VIA DEEP AUTOENCODERS FOR NOISY REVERBERANT SPEECH RECOGNITION
    Feng, Xue
    Zhang, Yaodong
    Glass, James
    2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,
  • [8] End-to-End Far-Field Speech Recognition with Unified Dereverberation and Beamforming
    Zhang, Wangyou
    Subramanian, Aswin Shanmugam
    Chang, Xuankai
    Watanabe, Shinji
    Qian, Yanmin
    INTERSPEECH 2020, 2020, : 324 - 328
  • [9] A Maximum Likelihood Approach to Deep Neural Network Based Speech Dereverberation
    Wang, Xin
    Du, Jun
    Wang, Yannan
    2017 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC 2017), 2017, : 155 - 158
  • [10] EmoFusionNet: A unified approach for robust speech emotion recognition
    Vijayan, Bineetha
    Judy, M. V.
    DIGITAL SIGNAL PROCESSING, 2025, 162