Vectorized Beam Search for CTC-Attention-based Speech Recognition

被引：19

作者：

Seki, Hiroshi ^{[1
]}

Hori, Takaaki ^{[2
]}

Watanabe, Shinji ^{[3
]}

Moritz, Niko ^{[2
]}

Le Roux, Jonathan ^{[2
]}

机构：

[1] Toyohashi Univ Technol, Toyohashi, Aichi, Japan

[2] Mitsubishi Elect Res Labs MERL, Cambridge, MA USA

[3] Johns Hopkins Univ, Baltimore, MD 21218 USA

来源：

INTERSPEECH 2019 | 2019年

关键词：

speech recognition; beam search; parallel computing; encoder-decoder network; GPU;

D O I：

10.21437/Interspeech.2019-2860

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

This paper investigates efficient beam search techniques for end-to-end automatic speech recognition (ASR) with attention-based encoder-decoder architecture. We accelerate the decoding process by vectorizing multiple hypotheses during the beam search, where we replace the score accumulation steps for each hypothesis with vector-matrix operations for the vectorized hypotheses. This modification allows us to take advantage of the parallel computing capabilities of multi-core CPUs and GPUs, resulting in significant speedups and also enabling us to process multiple utterances in a batch simultaneously. Moreover, we extend the decoding method to incorporate a recurrent neural network language model (RNNLM) and connectionist temporal classification (CTC) scores, which typically improve ASR accuracy but have not been investigated for the use of such parallelized decoding algorithms. Experiments with LibriSpeech and Corpus of Spontaneous Japanese datasets have demonstrated that the vectorized beam search achieves 1.8x speedup on a CPU and 33x speedup on a GPU compared with the original CPU implementation. When using joint CTC/attention decoding with an RNNLM, we also achieved 11x speedup on a GPU while maintaining the benefits of CTC and RNNLM. With these benefits, we achieved almost real-time processing with a small latency of 0.1 x real-time without streaming search process.

引用

页码：3825 / 3829

页数：5

共 50 条

[1] A new joint CTC-attention-based speech recognition model with multi-level multi-head attention
Chu-Xiong Qin
Wen-Lin Zhang
Dan Qu
EURASIP Journal on Audio, Speech, and Music Processing, 2019
[2] A new joint CTC-attention-based speech recognition model with multi-level multi-head attention
Qin, Chu-Xiong
Zhang, Wen-Lin
Qu, Dan
EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2019, 2019 (01)
[3] INTERDECODER: USING ATTENTION DECODERS AS INTERMEDIATE REGULARIZATION FOR CTC-BASED SPEECH RECOGNITION
Komatsu, Tatsuya
Fujita, Yusuke
2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 46 - 51
[4] Robust Beam Search for Encoder-Decoder Attention Based Speech Recognition without Length Bias
Zhou, Wei
Schluter, Ralf
Ney, Hermann
INTERSPEECH 2020, 2020, : 1768 - 1772
[5] ATTENTION-BASED GATED SCALING ADAPTIVE ACOUSTIC MODEL FOR CTC-BASED SPEECH RECOGNITION
Ding, Fenglin
Guo, Wu
Dai, Lirong
Du, Jun
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7404 - 7408
[6] AUDIO-VISUAL SPEECH RECOGNITION WITH A HYBRID CTC/ATTENTION ARCHITECTURE
Petridis, Stavros
Stafylakis, Themos
Ma, Pingchuan
Tzimiropoulos, Georgios
Pantic, Maja
2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 513 - 520
[7] Hybrid CTC/Attention Architecture for End-to-End Speech Recognition
Watanabe, Shinji
Hori, Takaaki
Kim, Suyoun
Hershey, John R.
Hayashi, Tomoki
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2017, 11 (08) : 1240 - 1253
[8] Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition
Ren, Zeyu
Yolwas, Nurmemet
Slamu, Wushour
Cao, Ronghe
Wang, Huiru
SENSORS, 2022, 22 (19)
[9] Joint CTC/attention decoding for end-to-end speech recognition
Hori, Takaaki
Watanabe, Shinji
Hershey, John R.
PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, : 518 - 529
[10] STREAMING END-TO-END SPEECH RECOGNITION WITH JOINT CTC-ATTENTION BASED MODELS
Moritz, Niko
Hori, Takaaki
Le Roux, Jonathan
2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 936 - 943

← 1 2 3 4 5 →