Improving Transformer-based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration

被引：134

作者：

Karita, Shigeki ^{[1
]}

Soplin, Nelson Enrique Yalta ^{[2
]}

Watanabe, Shinji ^{[3
]}

Delcroix, Marc ^{[1
]}

Ogawa, Atsunori ^{[1
]}

Nakatani, Tomohiro ^{[1
]}

机构：

[1] NTT Commun Sci Labs, Kyoto, Japan

[2] Waseda Univ, Tokyo, Japan

[3] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD 21218 USA

来源：

INTERSPEECH 2019 | 2019年

关键词：

speech recognition; Transformer; connectionist temporal classification; language model;

D O I：

10.21437/Interspeech.2019-1938

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

The state-of-the-art neural network architecture named Transformer has been used successfully for many sequence-to-sequence transformation tasks. The advantage of this architecture is that it has a fast iteration speed in the training stage because there is no sequential operation as with recurrent neural networks (RNN). However, an RNN is still the best option for end-to-end automatic speech recognition (ASR) tasks in terms of overall training speed (i.e., convergence) and word error rate (WER) because of effective joint training and decoding methods. To realize a faster and more accurate ASR system, we combine Transformer and the advances in RNN-based ASR. In our experiments, we found that the training of Transformer is slower than that of RNN as regards the learning curve and integration with the naive language model (LM) is difficult. To address these problems, we integrate connectionist temporal classification (CTC) with Transformer for joint training and decoding. This approach makes training faster than with RNNs and assists LM integration. Our proposed ASR system realizes significant improvements in various ASR tasks. For example, it reduced the WERs from 11.1% to 4.5% on the Wall Street Journal and from 16.1% to 11.6% on the TED-LIUM by introducing CTC and LM integration into the Transformer baseline.

引用

页码：1408 / 1412

页数：5

共 50 条

[21] End-to-end Keywords Spotting Based on Connectionist Temporal Classification for Mandarin
Bai, Ye
Yi, Jiangyan
Ni, Hao
Wen, Zhengqi
Liu, Bin
Li, Ya
Tao, Jianhua
2016 10TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2016,
[22] Spatial-temporal transformer for end-to-end sign language recognition
Cui, Zhenchao
Zhang, Wenbo
Li, Zhaoxin
Wang, Zhaoqi
COMPLEX & INTELLIGENT SYSTEMS, 2023, 9 (04) : 4645 - 4656
[23] Semantic Mask for Transformer based End-to-End Speech Recognition
Wang, Chengyi
Wu, Yu
Du, Yujiao
Li, Jinyu
Liu, Shujie
Lu, Liang
Ren, Shuo
Ye, Guoli
Zhao, Sheng
Zhou, Ming
INTERSPEECH 2020, 2020, : 971 - 975
[24] Multi-Encoder Learning and Stream Fusion for Transformer-Based End-to-End Automatic Speech Recognition
Lohrenz, Timo
Li, Zhengyang
Fingscheidt, Tim
INTERSPEECH 2021, 2021, : 2846 - 2850
[25] Fast offline transformer-based end-to-end automatic speech recognition for real-world applications
Oh, Yoo Rhee
Park, Kiyoung
Park, Jeon Gue
ETRI JOURNAL, 2022, 44 (03) : 476 - 490
[26] End to end transformer-based contextual speech recognition based on pointer network
Lin, Binghuai
Wang, Liyuan
INTERSPEECH 2021, 2021, : 2087 - 2091
[27] Transformer Model Compression for End-to-End Speech Recognition on Mobile Devices
Ben Letaifa, Leila
Rouas, Jean-Luc
2022 30TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2022), 2022, : 439 - 443
[28] HyperSFormer: A Transformer-Based End-to-End Hyperspectral Image Classification Method for Crop Classification
Xie, Jiaxing
Hua, Jiajun
Chen, Shaonan
Wu, Peiwen
Gao, Peng
Sun, Daozong
Lyu, Zhendong
Lyu, Shilei
Xue, Xiuyun
Lu, Jianqiang
REMOTE SENSING, 2023, 15 (14)
[29] Improving Mandarin End-to-End Speech Recognition With Word N-Gram Language Model
Tian, Jinchuan
Yu, Jianwei
Weng, Chao
Zou, Yuexian
Yu, Dong
IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 812 - 816
[30] End-to-End Speech Recognition of Tamil Language
Changrampadi, Mohamed Hashim
Shahina, A.
Narayanan, M. Badri
Khan, A. Nayeemulla
INTELLIGENT AUTOMATION AND SOFT COMPUTING, 2022, 32 (02): : 1309 - 1323

← 1 2 3 4 5 →