Vectorized Beam Search for CTC-Attention-based Speech Recognition

被引:19
|
作者
Seki, Hiroshi [1 ]
Hori, Takaaki [2 ]
Watanabe, Shinji [3 ]
Moritz, Niko [2 ]
Le Roux, Jonathan [2 ]
机构
[1] Toyohashi Univ Technol, Toyohashi, Aichi, Japan
[2] Mitsubishi Elect Res Labs MERL, Cambridge, MA USA
[3] Johns Hopkins Univ, Baltimore, MD 21218 USA
来源
关键词
speech recognition; beam search; parallel computing; encoder-decoder network; GPU;
D O I
10.21437/Interspeech.2019-2860
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
This paper investigates efficient beam search techniques for end-to-end automatic speech recognition (ASR) with attention-based encoder-decoder architecture. We accelerate the decoding process by vectorizing multiple hypotheses during the beam search, where we replace the score accumulation steps for each hypothesis with vector-matrix operations for the vectorized hypotheses. This modification allows us to take advantage of the parallel computing capabilities of multi-core CPUs and GPUs, resulting in significant speedups and also enabling us to process multiple utterances in a batch simultaneously. Moreover, we extend the decoding method to incorporate a recurrent neural network language model (RNNLM) and connectionist temporal classification (CTC) scores, which typically improve ASR accuracy but have not been investigated for the use of such parallelized decoding algorithms. Experiments with LibriSpeech and Corpus of Spontaneous Japanese datasets have demonstrated that the vectorized beam search achieves 1.8x speedup on a CPU and 33x speedup on a GPU compared with the original CPU implementation. When using joint CTC/attention decoding with an RNNLM, we also achieved 11x speedup on a GPU while maintaining the benefits of CTC and RNNLM. With these benefits, we achieved almost real-time processing with a small latency of 0.1 x real-time without streaming search process.
引用
收藏
页码:3825 / 3829
页数:5
相关论文
共 50 条
  • [21] Improving Hybrid CTC/Attention Architecture with Time-Restricted Self-Attention CTC for End-to-End Speech Recognition
    Wu, Long
    Li, Ta
    Wang, Li
    Yan, Yonghong
    APPLIED SCIENCES-BASEL, 2019, 9 (21):
  • [22] Attention based end to end Speech Recognition for Voice Search in Hindi and English
    Joshi, Raviraj
    Kannan, Venkateshan
    FIRE 2021: PROCEEDINGS OF THE 13TH ANNUAL MEETING OF THE FORUM FOR INFORMATION RETRIEVAL EVALUATION, 2021, : 107 - 113
  • [23] ATTENTION-BASED END-TO-END SPEECH RECOGNITION ON VOICE SEARCH
    Shan, Changhao
    Zhang, Junbo
    Wang, Yujun
    Xie, Lei
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4764 - 4768
  • [24] Offline Handwritten Text Recognition Based on CTC-Attention
    Ma Yangyang
    Xiao Bing
    LASER & OPTOELECTRONICS PROGRESS, 2021, 58 (12)
  • [25] INTERMEDIATE LOSS REGULARIZATION FOR CTC-BASED SPEECH RECOGNITION
    Lee, Jaesong
    Watanabe, Shinji
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6224 - 6228
  • [26] Mandarin Electrolaryngeal Speech Recognition Based on WaveNet-CTC
    Qian, Zhaopeng
    Wang, Li
    Zhang, Shaochuan
    Liu, Chan
    Niu, Haijun
    JOURNAL OF SPEECH LANGUAGE AND HEARING RESEARCH, 2019, 62 (07): : 2203 - 2212
  • [27] End-to-end recognition of streaming Japanese speech using CTC and local attention
    Chen, Jiahao
    Nishimura, Ryota
    Kitaoka, Norihide
    APSIPA TRANSACTIONS ON SIGNAL AND INFORMATION PROCESSING, 2020, 9 (01)
  • [28] Hybrid CTC/Attention End-to-End Chinese Speech Recognition Enhanced by Conformer
    Chen, Ge
    Xie, Xukang
    Sun, Jun
    Chen, Qidong
    Computer Engineering and Applications, 2024, 59 (04) : 97 - 103
  • [29] A Speech Recognition Acoustic Model Based on LSTM-CTC
    Zhang, Yiwen
    Lu, Xuanmin
    2018 IEEE 18TH INTERNATIONAL CONFERENCE ON COMMUNICATION TECHNOLOGY (ICCT), 2018, : 1052 - 1055
  • [30] Adaptive Speaker Normalization for CTC-Based Speech Recognition
    Ding, Penguin
    Guo, Wu
    Gu, Bin
    Ling, Zhenhua
    Du, Jun
    INTERSPEECH 2020, 2020, : 1266 - 1270