Vectorized Beam Search for CTC-Attention-based Speech Recognition

被引:19
|
作者
Seki, Hiroshi [1 ]
Hori, Takaaki [2 ]
Watanabe, Shinji [3 ]
Moritz, Niko [2 ]
Le Roux, Jonathan [2 ]
机构
[1] Toyohashi Univ Technol, Toyohashi, Aichi, Japan
[2] Mitsubishi Elect Res Labs MERL, Cambridge, MA USA
[3] Johns Hopkins Univ, Baltimore, MD 21218 USA
来源
关键词
speech recognition; beam search; parallel computing; encoder-decoder network; GPU;
D O I
10.21437/Interspeech.2019-2860
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
This paper investigates efficient beam search techniques for end-to-end automatic speech recognition (ASR) with attention-based encoder-decoder architecture. We accelerate the decoding process by vectorizing multiple hypotheses during the beam search, where we replace the score accumulation steps for each hypothesis with vector-matrix operations for the vectorized hypotheses. This modification allows us to take advantage of the parallel computing capabilities of multi-core CPUs and GPUs, resulting in significant speedups and also enabling us to process multiple utterances in a batch simultaneously. Moreover, we extend the decoding method to incorporate a recurrent neural network language model (RNNLM) and connectionist temporal classification (CTC) scores, which typically improve ASR accuracy but have not been investigated for the use of such parallelized decoding algorithms. Experiments with LibriSpeech and Corpus of Spontaneous Japanese datasets have demonstrated that the vectorized beam search achieves 1.8x speedup on a CPU and 33x speedup on a GPU compared with the original CPU implementation. When using joint CTC/attention decoding with an RNNLM, we also achieved 11x speedup on a GPU while maintaining the benefits of CTC and RNNLM. With these benefits, we achieved almost real-time processing with a small latency of 0.1 x real-time without streaming search process.
引用
收藏
页码:3825 / 3829
页数:5
相关论文
共 50 条
  • [31] Online Hybrid CTC/Attention End-to-End Automatic Speech Recognition Architecture
    Miao, Haoran
    Cheng, Gaofeng
    Zhang, Pengyuan
    Yan, Yonghong
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 : 1452 - 1465
  • [32] Exploring Hybrid CTC/Attention End-to-End Speech Recognition with Gaussian Processes
    Kuerzinger, Ludwig
    Watzel, Tobias
    Li, Lujun
    Baumgartner, Robert
    Rigoll, Gerhard
    SPEECH AND COMPUTER, SPECOM 2019, 2019, 11658 : 258 - 269
  • [33] PERSONALIZATION OF CTC SPEECH RECOGNITION MODELS
    Dingliwal, Saket
    Sunkara, Monica
    Ronanki, Srikanth
    Farris, Jeff
    Kirchhoff, Katrin
    Bodapati, Sravan
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 302 - 309
  • [34] SEARCH ERROR RISK MINIMIZATION IN VITERBI BEAM SEARCH FOR SPEECH RECOGNITION
    Hori, Takaaki
    Watanabe, Shinji
    Nakamura, Atsushi
    2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 4934 - 4937
  • [35] JOINT CTC-ATTENTION BASED END-TO-END SPEECH RECOGNITION USING MULTI-TASK LEARNING
    Kim, Suyoun
    Hori, Takaaki
    Watanabe, Shinji
    2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 4835 - 4839
  • [36] Hybrid CTC-Attention Network-Based End-to-End Speech Recognition System for Korean Language
    Park, Hosung
    Kim, Changmin
    Son, Hyunsoo
    Seo, Soonshin
    Kim, Ji-Hwan
    JOURNAL OF WEB ENGINEERING, 2022, 21 (02): : 265 - 284
  • [37] GAUSSIAN KERNELIZED SELF-ATTENTION FOR LONG SEQUENCE DATA AND ITS APPLICATION TO CTC-BASED SPEECH RECOGNITION
    Kashiwagi, Yosuke
    Tsunoo, Emiru
    Watanabe, Shinji
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6214 - 6218
  • [38] Research on Tibetan Speech Recognition Based on CNN-DFSMN-CTC
    Northwest Normal University, Engineering Research Center of Gansu Province for Intelligent Information Technology and Application, College of Physics and Electronic Engineering, LanZhou, China
    Proc. - Asia Conf. Electr. Power Comput. Eng., EPCE, (215-219): : 215 - 219
  • [39] Investigating Joint CTC-Attention Models for End-to-End Russian Speech Recognition
    Markovnikov, Nikita
    Kipyatkova, Irina
    SPEECH AND COMPUTER, SPECOM 2019, 2019, 11658 : 337 - 347
  • [40] Enhancing CTC-based speech recognition with diverse modeling units
    Han, Shiyi
    Lei, Zhihong
    Xu, Mingbin
    Na, Xingyu
    Huang, Zhen
    INTERSPEECH 2024, 2024, : 4583 - 4587