Vectorized Beam Search for CTC-Attention-based Speech Recognition

被引:19
|
作者
Seki, Hiroshi [1 ]
Hori, Takaaki [2 ]
Watanabe, Shinji [3 ]
Moritz, Niko [2 ]
Le Roux, Jonathan [2 ]
机构
[1] Toyohashi Univ Technol, Toyohashi, Aichi, Japan
[2] Mitsubishi Elect Res Labs MERL, Cambridge, MA USA
[3] Johns Hopkins Univ, Baltimore, MD 21218 USA
来源
关键词
speech recognition; beam search; parallel computing; encoder-decoder network; GPU;
D O I
10.21437/Interspeech.2019-2860
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
This paper investigates efficient beam search techniques for end-to-end automatic speech recognition (ASR) with attention-based encoder-decoder architecture. We accelerate the decoding process by vectorizing multiple hypotheses during the beam search, where we replace the score accumulation steps for each hypothesis with vector-matrix operations for the vectorized hypotheses. This modification allows us to take advantage of the parallel computing capabilities of multi-core CPUs and GPUs, resulting in significant speedups and also enabling us to process multiple utterances in a batch simultaneously. Moreover, we extend the decoding method to incorporate a recurrent neural network language model (RNNLM) and connectionist temporal classification (CTC) scores, which typically improve ASR accuracy but have not been investigated for the use of such parallelized decoding algorithms. Experiments with LibriSpeech and Corpus of Spontaneous Japanese datasets have demonstrated that the vectorized beam search achieves 1.8x speedup on a CPU and 33x speedup on a GPU compared with the original CPU implementation. When using joint CTC/attention decoding with an RNNLM, we also achieved 11x speedup on a GPU while maintaining the benefits of CTC and RNNLM. With these benefits, we achieved almost real-time processing with a small latency of 0.1 x real-time without streaming search process.
引用
收藏
页码:3825 / 3829
页数:5
相关论文
共 50 条
  • [1] A new joint CTC-attention-based speech recognition model with multi-level multi-head attention
    Chu-Xiong Qin
    Wen-Lin Zhang
    Dan Qu
    EURASIP Journal on Audio, Speech, and Music Processing, 2019
  • [2] A new joint CTC-attention-based speech recognition model with multi-level multi-head attention
    Qin, Chu-Xiong
    Zhang, Wen-Lin
    Qu, Dan
    EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2019, 2019 (01)
  • [3] INTERDECODER: USING ATTENTION DECODERS AS INTERMEDIATE REGULARIZATION FOR CTC-BASED SPEECH RECOGNITION
    Komatsu, Tatsuya
    Fujita, Yusuke
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 46 - 51
  • [4] Robust Beam Search for Encoder-Decoder Attention Based Speech Recognition without Length Bias
    Zhou, Wei
    Schluter, Ralf
    Ney, Hermann
    INTERSPEECH 2020, 2020, : 1768 - 1772
  • [5] ATTENTION-BASED GATED SCALING ADAPTIVE ACOUSTIC MODEL FOR CTC-BASED SPEECH RECOGNITION
    Ding, Fenglin
    Guo, Wu
    Dai, Lirong
    Du, Jun
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7404 - 7408
  • [6] AUDIO-VISUAL SPEECH RECOGNITION WITH A HYBRID CTC/ATTENTION ARCHITECTURE
    Petridis, Stavros
    Stafylakis, Themos
    Ma, Pingchuan
    Tzimiropoulos, Georgios
    Pantic, Maja
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 513 - 520
  • [7] Hybrid CTC/Attention Architecture for End-to-End Speech Recognition
    Watanabe, Shinji
    Hori, Takaaki
    Kim, Suyoun
    Hershey, John R.
    Hayashi, Tomoki
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2017, 11 (08) : 1240 - 1253
  • [8] Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition
    Ren, Zeyu
    Yolwas, Nurmemet
    Slamu, Wushour
    Cao, Ronghe
    Wang, Huiru
    SENSORS, 2022, 22 (19)
  • [9] Joint CTC/attention decoding for end-to-end speech recognition
    Hori, Takaaki
    Watanabe, Shinji
    Hershey, John R.
    PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, : 518 - 529
  • [10] STREAMING END-TO-END SPEECH RECOGNITION WITH JOINT CTC-ATTENTION BASED MODELS
    Moritz, Niko
    Hori, Takaaki
    Le Roux, Jonathan
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 936 - 943