Two-stage deep spectrum fusion for noise-robust end-to-end speech recognition

被引:5
|
作者
Fan, Cunhang [1 ]
Ding, Mingming [1 ]
Yi, Jiangyan [2 ]
Li, Jinpeng [3 ]
Lv, Zhao [1 ]
机构
[1] Anhui Univ, Sch Comp Sci & Technol, Anhui Prov Key Lab Multimodal Cognit Computat, Hefei, Peoples R China
[2] Chinese Acad Sci, Inst Automat, NLPR, Beijing, Peoples R China
[3] Univ Chinese Acad Sci, Ningbo Inst Life & Hlth Ind, Ningbo, Peoples R China
基金
中国国家自然科学基金;
关键词
Robust end-to-end ASR; Speech enhancement; Masking and mapping; Speech distortion; Deep spectrum fusion; ENHANCEMENT; NETWORKS; DEREVERBERATION;
D O I
10.1016/j.apacoust.2023.109547
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Recently, speech enhancement (SE) methods have achieved quite good performances. However, because of the speech distortion problem, the enhanced speech may lose significant information, which degrades the performance of automatic speech recognition (ASR). To address this problem, this paper proposes a two-stage deep spectrum fusion with the joint training framework for noise-robust end-to-end (E2E) ASR. It consists of a masking and mapping fusion (MMF) and a gated recurrent fusion (GRF). The MMF is used as the first stage and focuses on SE, which explores the complementarity of the enhancement methods of masking-based and mapping based to alleviate the problem of speech distortion. The GRF is used as the second stage and aims to further retrieve the lost information by fusing the enhanced speech of MMF and the original input. We conduct extensive experiments on an open Mandarin speech corpus AISHELL-1 with two noise datasets named 100 Nonspeech and NOISEX-92. Experimental results indicate that our proposed method significantly improves the performance and the character error rate (CER) is relatively reduced by 17.36% compared with the conventional joint training method.
引用
收藏
页数:10
相关论文
共 50 条
  • [21] An FFT-Based Companding Front End for Noise-Robust Automatic Speech Recognition
    Raj, Bhiksha
    Turicchia, Lorenzo
    Schmidt-Nielsen, Bent
    Sarpeshkar, Rahul
    EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2007,
  • [22] An FFT-Based Companding Front End for Noise-Robust Automatic Speech Recognition
    Bhiksha Raj
    Lorenzo Turicchia
    Bent Schmidt-Nielsen
    Rahul Sarpeshkar
    EURASIP Journal on Audio, Speech, and Music Processing, 2007
  • [23] Efficient Noise-Robust Speech Recognition Front-End Based on the ETSI Standard
    Neves, Claudio
    Veiga, Arlindo
    Sa, Luis
    Perdigao, Fernando
    ICSP: 2008 9TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING, VOLS 1-5, PROCEEDINGS, 2008, : 609 - 612
  • [24] TSE-CNN: A Two-Stage End-to-End CNN for Human Activity Recognition
    Huang, Jiahui
    Lin, Shuisheng
    Wang, Ning
    Dai, Guanghai
    Xie, Yuxiang
    Zhou, Jun
    IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2020, 24 (01) : 292 - 299
  • [25] VERY DEEP CONVOLUTIONAL NETWORKS FOR END-TO-END SPEECH RECOGNITION
    Zhang, Yu
    Chan, William
    Jaitly, Navdeep
    2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 4845 - 4849
  • [26] Arabic speech recognition using end-to-end deep learning
    Alsayadi, Hamzah A.
    Abdelhamid, Abdelaziz A.
    Hegazy, Islam
    Fayed, Zaki T.
    IET SIGNAL PROCESSING, 2021, 15 (08) : 521 - 534
  • [27] End-to-End Automatic Speech Recognition with Deep Mutual Learning
    Masumura, Ryo
    Ihori, Mana
    Takashima, Akihiko
    Tanaka, Tomohiro
    Ashihara, Takanori
    2020 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2020, : 632 - 637
  • [28] Noise-robust speech recognition based on difference of power spectrum
    Xu, JF
    Wei, G
    ELECTRONICS LETTERS, 2000, 36 (14) : 1247 - 1248
  • [29] End-to-End Deep Learning Speech Recognition Model for Silent Speech Challenge
    Kimura, Naoki
    Su, Zixiong
    Saeki, Takaaki
    INTERSPEECH 2020, 2020, : 1025 - 1026
  • [30] Adversarial Regularization for Attention Based End-to-End Robust Speech Recognition
    Sun, Sining
    Guo, Pengcheng
    Xie, Lei
    Hwang, Mei-Yuh
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (11) : 1826 - 1838