Vectorized Beam Search for CTC-Attention-based Speech Recognition

被引：19

作者：

Seki, Hiroshi ^{[1
]}

Hori, Takaaki ^{[2
]}

Watanabe, Shinji ^{[3
]}

Moritz, Niko ^{[2
]}

Le Roux, Jonathan ^{[2
]}

机构：

[1] Toyohashi Univ Technol, Toyohashi, Aichi, Japan

[2] Mitsubishi Elect Res Labs MERL, Cambridge, MA USA

[3] Johns Hopkins Univ, Baltimore, MD 21218 USA

来源：

INTERSPEECH 2019 | 2019年

关键词：

speech recognition; beam search; parallel computing; encoder-decoder network; GPU;

D O I：

10.21437/Interspeech.2019-2860

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

This paper investigates efficient beam search techniques for end-to-end automatic speech recognition (ASR) with attention-based encoder-decoder architecture. We accelerate the decoding process by vectorizing multiple hypotheses during the beam search, where we replace the score accumulation steps for each hypothesis with vector-matrix operations for the vectorized hypotheses. This modification allows us to take advantage of the parallel computing capabilities of multi-core CPUs and GPUs, resulting in significant speedups and also enabling us to process multiple utterances in a batch simultaneously. Moreover, we extend the decoding method to incorporate a recurrent neural network language model (RNNLM) and connectionist temporal classification (CTC) scores, which typically improve ASR accuracy but have not been investigated for the use of such parallelized decoding algorithms. Experiments with LibriSpeech and Corpus of Spontaneous Japanese datasets have demonstrated that the vectorized beam search achieves 1.8x speedup on a CPU and 33x speedup on a GPU compared with the original CPU implementation. When using joint CTC/attention decoding with an RNNLM, we also achieved 11x speedup on a GPU while maintaining the benefits of CTC and RNNLM. With these benefits, we achieved almost real-time processing with a small latency of 0.1 x real-time without streaming search process.

引用

页码：3825 / 3829

页数：5

共 50 条

[31] Online Hybrid CTC/Attention End-to-End Automatic Speech Recognition Architecture
Miao, Haoran
Cheng, Gaofeng
Zhang, Pengyuan
Yan, Yonghong
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 : 1452 - 1465
[32] Exploring Hybrid CTC/Attention End-to-End Speech Recognition with Gaussian Processes
Kuerzinger, Ludwig
Watzel, Tobias
Li, Lujun
Baumgartner, Robert
Rigoll, Gerhard
SPEECH AND COMPUTER, SPECOM 2019, 2019, 11658 : 258 - 269
[33] PERSONALIZATION OF CTC SPEECH RECOGNITION MODELS
Dingliwal, Saket
Sunkara, Monica
Ronanki, Srikanth
Farris, Jeff
Kirchhoff, Katrin
Bodapati, Sravan
2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 302 - 309
[34] SEARCH ERROR RISK MINIMIZATION IN VITERBI BEAM SEARCH FOR SPEECH RECOGNITION
Hori, Takaaki
Watanabe, Shinji
Nakamura, Atsushi
2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 4934 - 4937
[35] JOINT CTC-ATTENTION BASED END-TO-END SPEECH RECOGNITION USING MULTI-TASK LEARNING
Kim, Suyoun
Hori, Takaaki
Watanabe, Shinji
2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 4835 - 4839
[36] Hybrid CTC-Attention Network-Based End-to-End Speech Recognition System for Korean Language
Park, Hosung
Kim, Changmin
Son, Hyunsoo
Seo, Soonshin
Kim, Ji-Hwan
JOURNAL OF WEB ENGINEERING, 2022, 21 (02): : 265 - 284
[37] GAUSSIAN KERNELIZED SELF-ATTENTION FOR LONG SEQUENCE DATA AND ITS APPLICATION TO CTC-BASED SPEECH RECOGNITION
Kashiwagi, Yosuke
Tsunoo, Emiru
Watanabe, Shinji
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6214 - 6218
[38] Research on Tibetan Speech Recognition Based on CNN-DFSMN-CTC
Northwest Normal University, Engineering Research Center of Gansu Province for Intelligent Information Technology and Application, College of Physics and Electronic Engineering, LanZhou, China
Proc. - Asia Conf. Electr. Power Comput. Eng., EPCE, (215-219): : 215 - 219
[39] Investigating Joint CTC-Attention Models for End-to-End Russian Speech Recognition
Markovnikov, Nikita
Kipyatkova, Irina
SPEECH AND COMPUTER, SPECOM 2019, 2019, 11658 : 337 - 347
[40] Enhancing CTC-based speech recognition with diverse modeling units
Han, Shiyi
Lei, Zhihong
Xu, Mingbin
Na, Xingyu
Huang, Zhen
INTERSPEECH 2024, 2024, : 4583 - 4587

← 1 2 3 4 5 →