A UNIFIED DEEP MODELING APPROACH TO SIMULTANEOUS SPEECH DEREVERBERATION AND RECOGNITION FOR THE REVERB CHALLENGE

被引:0
|
作者
Wu, Bo [1 ]
Li, Kehuang [2 ]
Huang, Zhen [2 ]
Siniscalchi, Sabato Marco [2 ,3 ]
Yang, Minglei [1 ]
Lee, Chin-Hui [2 ]
机构
[1] Xidian Univ, Natl Lab Radar Signal Proc, Xian, Shaanxi, Peoples R China
[2] Georgia Inst Technol, Sch Elect & Comp Engn, Atlanta, GA 30332 USA
[3] Univ Enna Kore, I-94100 Enna, Italy
关键词
Signal space robustness; deep modeling; reverberant speech enhancement; robust speech recognition; SUPPRESSION;
D O I
暂无
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
We propose a unified deep neural network (DNN) approach to achieve both high-quality enhanced speech and high-accuracy automatic speech recognition (ASR) simultaneously on the recent REverberant Voice Enhancement and Recognition Benchmark (RE-VERB) Challenge. These two goals are accomplished by two proposed techniques, namely DNN-based regression to enhance reverberant and noisy speech, followed by DNN-based multi-condition training that takes clean-condition, multi-condition and enhanced speech all into consideration. We first report on superior objective measures in enhanced speech to those listed in the 2014 REVERB Challenge Workshop. We then show that in clean-condition training, we obtain the best word error rate (WER) of 13.28% on the 1-channel REVERB simulated evaluation data with the proposed DNN-based pre-processing scheme. Similarly we attain a competitive single-system WER of 8.75% with the proposed multi-condition training strategy and the same less-discriminative log power spectrum features used in the enhancement stage. Finally by leveraging upon joint training with more discriminative ASR features and improved neural network based language models a state-of-the-art WER of 4.46% is attained with a single ASR system, and single-channel information. Another state-of-the-art WER of 4.10% is achieved through system combination.
引用
收藏
页码:36 / 40
页数:5
相关论文
共 50 条
  • [21] Bayesian Integration of Sound Source Separation and Speech Recognition: A New Approach to Simultaneous Speech Recognition
    Itakura, Kousuke
    Nishimuta, Izaya
    Bando, Yoshiaki
    Itoyama, Katsutoshi
    Yoshii, Kazuyoshi
    16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 736 - 740
  • [22] End-to-End Deep Learning Speech Recognition Model for Silent Speech Challenge
    Kimura, Naoki
    Su, Zixiong
    Saeki, Takaaki
    INTERSPEECH 2020, 2020, : 1025 - 1026
  • [23] Deep Neural Networks for Acoustic Modeling in Speech Recognition
    Hinton, Geoffrey
    Deng, Li
    Yu, Dong
    Dahl, George E.
    Mohamed, Abdel-rahman
    Jaitly, Navdeep
    Senior, Andrew
    Vanhoucke, Vincent
    Patrick Nguyen
    Sainath, Tara N.
    Kingsbury, Brian
    IEEE SIGNAL PROCESSING MAGAZINE, 2012, 29 (06) : 82 - 97
  • [24] Joint Training of Multi-channel-condition Dereverberation and Acoustic Modeling of Microphone Array Speech for Robust Distant Speech Recognition
    Ge, Fengpei
    Li, Kehuang
    Wu, Bo
    Siniscalchi, Sabato Marco
    Yan, Yonghong
    Lee, Chin-Hui
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 3847 - 3851
  • [25] State-Clustering Based Multiple Deep Neural Networks Modeling Approach for Speech Recognition
    Zhou, Pan
    Jiang, Hui
    Dai, Li-Rong
    Hu, Yu
    Liu, Qing-Feng
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2015, 23 (04) : 631 - 642
  • [26] State-clustering based multiple deep neural networks modeling approach for speech recognition
    National Engineering Laboratory of Speech and Language Information Processing, University of Science and Technology of China, Hefei
    230026, China
    不详
    ON
    M3J1P3, Canada
    IEEE ACM Trans. Audio Speech Lang. Process., 4 (631-642):
  • [27] Data-driven environmental compensation for speech recognition: A unified approach
    Moreno, PJ
    Raj, B
    Stern, RM
    SPEECH COMMUNICATION, 1998, 24 (04) : 267 - 285
  • [28] A unified approach to speech production and recognition based on articulatory motor representations
    Hornstein, Jonas
    Santos-Victor, Jose
    2007 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, VOLS 1-9, 2007, : 3448 - 3453
  • [29] SPEECH EMOTION RECOGNITION-A DEEP LEARNING APPROACH
    Asiya, U. A.
    Kiran, V. K.
    PROCEEDINGS OF THE 2021 FIFTH INTERNATIONAL CONFERENCE ON I-SMAC (IOT IN SOCIAL, MOBILE, ANALYTICS AND CLOUD) (I-SMAC 2021), 2021, : 867 - 871
  • [30] A Late Reverberation Power Spectral Density Aware Approach to Speech Dereverberation Based on Deep Neural Networks
    Qi, Yuanlei
    Yang, Feiran
    Yang, Jun
    2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 1700 - 1703