HAVE BEST OF BOTH WORLDS: TWO-PASS HYBRID AND E2E CASCADING FRAMEWORK FOR SPEECH RECOGNITION

被引:2
|
作者
Ye, Guoli [1 ]
Mazalov, Vadim [1 ]
Li, Jinyu [1 ]
Gong, Yifan [1 ]
机构
[1] Microsoft Corp, Redmond, WA 98052 USA
关键词
two-pass; hybrid; end-to-end; cascaded; combination;
D O I
10.1109/ICASSP43922.2022.9747144
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Hybrid and end-to-end (E2E) systems have their individual advantages, with different error patterns in the speech recognition results. By jointly modeling audio and text, the E2E model performs better in matched scenarios and scales well with a large amount of paired audio-text training data. The modularized hybrid model is easier for customization, and better to make use of a massive amount of unpaired text data. This paper proposes a two-pass hybrid and E2E cascading (HEC) framework to combine the hybrid and E2E model in order to take advantage of both sides, with hybrid in the first pass and E2E in the second pass. We show that the proposed system achieves 8-10% relative word error rate reduction with respect to each individual system. More importantly, compared with the pure E2E system, we show the proposed system has the potential to keep the advantages of hybrid system, e.g., customization and segmentation capabilities. We also show the second pass E2E model in HEC is robust with respect to the change in the first pass hybrid model.
引用
收藏
页码:7432 / 7436
页数:5
相关论文
共 13 条
  • [1] Deep Neural Network Calibration for E2E Speech Recognition System
    Lee, Mun-Hak
    Chang, Joon-Hyuk
    INTERSPEECH 2021, 2021, : 4064 - 4068
  • [2] Utterance invariant training for hybrid two-pass end-to-end speech recognition
    Gowda, Dhananjaya
    Kumar, Ankur
    Kim, Kwangyoun
    Yang, Hejung
    Garg, Abhinav
    Singh, Sachin
    Kim, Jiyeon
    Jin, Mehul Kumar Sichen
    Singh, Shatrughan
    Kim, Chanwoo
    INTERSPEECH 2020, 2020, : 2827 - 2831
  • [3] Label-Synchronous Neural Transducer for Adaptable Online E2E Speech Recognition
    Deng, Keqi
    Woodland, Philip C.
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 3507 - 3516
  • [4] Dissecting User-Perceived Latency of On-Device E2E Speech Recognition
    Yuan Shangguan
    Prabhavalkar, Rohit
    Hang Su
    Mahadeokar, Jay
    Shi, Yangyang
    Zhou, Jiatong
    Wu, Chunyang
    Duc Le
    Kalinli, Ozlem
    Fuegen, Christian
    Seltzer, Michael L.
    INTERSPEECH 2021, 2021, : 4553 - 4557
  • [5] Rapid Language Adaptation for Multilingual E2E Speech Recognition Using Encoder Prompting
    Kashiwagi, Yosuke
    Futami, Hayato
    Tsunoo, Emiru
    Arora, Siddhant
    Watanabe, Shinji
    INTERSPEECH 2024, 2024, : 2900 - 2904
  • [6] Leveraging Phone Mask Training for Phonetic-Reduction-Robust E2E Uyghur Speech Recognition
    Ma, Guodong
    Hu, Pengfei
    Kang, Jian
    Huang, Shen
    Huang, Hao
    INTERSPEECH 2021, 2021, : 306 - 310
  • [7] Few-shot learning for E2E speech recognition: architectural variants for support set generation
    Eledath, Dhanya
    Thurlapati, Narasimha Rao
    Pavithra, V
    Banerjee, Tirthankar
    Ramasubramanian, V
    2022 30TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2022), 2022, : 444 - 448
  • [8] INTERNAL LANGUAGE MODEL PERSONALIZATION OF E2E AUTOMATIC SPEECH RECOGNITION USING RANDOM ENCODER FEATURES
    Stooke, Adam
    Sim, Khe Chai
    Chua, Mason
    Munkhdalai, Tsendsuren
    Strohman, Trevor
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 213 - 220
  • [9] DIRECTIONAL ASR: A NEW PARADIGM FOR E2E MULTI-SPEAKER SPEECH RECOGNITION WITH SOURCE LOCALIZATION
    Subramanian, Aswin Shanmugam
    Weng, Chao
    Watanabe, Shinji
    Yu, Meng
    Xu, Yong
    Zhang, Shi-Xiong
    Yu, Dong
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 8433 - 8437
  • [10] Improving Recognition of Out-of-vocabulary Words in E2E Code-switching ASR by Fusing Speech Generation Methods
    Ye, Lingxuan
    Cheng, Gaofeng
    Yang, Runyan
    Yang, Zehui
    Tian, Sanli
    Zhang, Pengyuan
    Yan, Yonghong
    INTERSPEECH 2022, 2022, : 3163 - 3167