Two-stage deep spectrum fusion for noise-robust end-to-end speech recognition

被引:5
|
作者
Fan, Cunhang [1 ]
Ding, Mingming [1 ]
Yi, Jiangyan [2 ]
Li, Jinpeng [3 ]
Lv, Zhao [1 ]
机构
[1] Anhui Univ, Sch Comp Sci & Technol, Anhui Prov Key Lab Multimodal Cognit Computat, Hefei, Peoples R China
[2] Chinese Acad Sci, Inst Automat, NLPR, Beijing, Peoples R China
[3] Univ Chinese Acad Sci, Ningbo Inst Life & Hlth Ind, Ningbo, Peoples R China
基金
中国国家自然科学基金;
关键词
Robust end-to-end ASR; Speech enhancement; Masking and mapping; Speech distortion; Deep spectrum fusion; ENHANCEMENT; NETWORKS; DEREVERBERATION;
D O I
10.1016/j.apacoust.2023.109547
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Recently, speech enhancement (SE) methods have achieved quite good performances. However, because of the speech distortion problem, the enhanced speech may lose significant information, which degrades the performance of automatic speech recognition (ASR). To address this problem, this paper proposes a two-stage deep spectrum fusion with the joint training framework for noise-robust end-to-end (E2E) ASR. It consists of a masking and mapping fusion (MMF) and a gated recurrent fusion (GRF). The MMF is used as the first stage and focuses on SE, which explores the complementarity of the enhancement methods of masking-based and mapping based to alleviate the problem of speech distortion. The GRF is used as the second stage and aims to further retrieve the lost information by fusing the enhanced speech of MMF and the original input. We conduct extensive experiments on an open Mandarin speech corpus AISHELL-1 with two noise datasets named 100 Nonspeech and NOISEX-92. Experimental results indicate that our proposed method significantly improves the performance and the character error rate (CER) is relatively reduced by 17.36% compared with the conventional joint training method.
引用
收藏
页数:10
相关论文
共 50 条
  • [41] END-TO-END ANCHORED SPEECH RECOGNITION
    Wang, Yiming
    Fan, Xing
    Chen, I-Fan
    Liu, Yuzong
    Chen, Tongfei
    Hoffmeister, Bjorn
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 7090 - 7094
  • [42] Contextualized Streaming End-to-End Speech Recognition with Trie-Based Deep Biasing and Shallow Fusion
    Duc Le
    Jain, Mahaveer
    Keren, Gil
    Kim, Suyoun
    Shi, Yangyang
    Mahadeokar, Jay
    Chan, Julian
    Shangguan, Yuan
    Fuegen, Christian
    Kalinli, Ozlem
    Saraf, Yatharth
    Seltzer, Michael L.
    INTERSPEECH 2021, 2021, : 1772 - 1776
  • [43] Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks
    Zhang, Ying
    Pezeshki, Mohammad
    Brakel, Philemon
    Zhang, Saizheng
    Laurent, Cesar
    Bengio, Yoshua
    Courville, Aaron
    17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 410 - 414
  • [44] END-TO-END SPEECH EMOTION RECOGNITION USING DEEP NEURAL NETWORKS
    Tzirakis, Panagiotis
    Zhang, Jiehao
    Schuller, Bjoern W.
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5089 - 5093
  • [45] A Two-Stage End-to-End Deep Learning Framework for Pathologic Examination in Skin Tumor Diagnosis
    Shi, Zhijie
    Zhu, Jingyi
    Yu, Liheng
    Li, Xiaopeng
    Li, Jiaxin
    Chen, Huyan
    Chen, Lianjun
    AMERICAN JOURNAL OF PATHOLOGY, 2023, 193 (06): : 769 - 777
  • [46] IMPROVING UNSUPERVISED STYLE TRANSFER IN END-TO-END SPEECH SYNTHESIS WITH END-TO-END SPEECH RECOGNITION
    Liu, Da-Rong
    Yang, Chi-Yu
    Wu, Szu-Lin
    Lee, Hung-Yi
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 640 - 647
  • [47] Sparse coding of the modulation spectrum for noise-robust automatic speech recognition
    Sara Ahmadi
    Seyed Mohammad Ahadi
    Bert Cranen
    Lou Boves
    EURASIP Journal on Audio, Speech, and Music Processing, 2014
  • [48] Sparse coding of the modulation spectrum for noise-robust automatic speech recognition
    Ahmadi, Sara
    Ahadi, Seyed Mohammad
    Cranen, Bert
    Boves, Lou
    EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2014, : 1 - 20
  • [49] Gated Embeddings in End-to-End Speech Recognition for Conversational-Context Fusion
    Kim, Suyoun
    Dalmia, Siddharth
    Metze, Florian
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 1131 - 1141
  • [50] END-TO-END TRAINING OF A LARGE VOCABULARY END-TO-END SPEECH RECOGNITION SYSTEM
    Kim, Chanwoo
    Kim, Sungsoo
    Kim, Kwangyoun
    Kumar, Mehul
    Kim, Jiyeon
    Lee, Kyungmin
    Han, Changwoo
    Garg, Abhinav
    Kim, Eunhyang
    Shin, Minkyoo
    Singh, Shatrughan
    Heck, Larry
    Gowda, Dhananjaya
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 562 - 569