Improving Transformer-based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration

被引：134

作者：

Karita, Shigeki ^{[1
]}

Soplin, Nelson Enrique Yalta ^{[2
]}

Watanabe, Shinji ^{[3
]}

Delcroix, Marc ^{[1
]}

Ogawa, Atsunori ^{[1
]}

Nakatani, Tomohiro ^{[1
]}

机构：

[1] NTT Commun Sci Labs, Kyoto, Japan

[2] Waseda Univ, Tokyo, Japan

[3] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD 21218 USA

来源：

INTERSPEECH 2019 | 2019年

关键词：

speech recognition; Transformer; connectionist temporal classification; language model;

D O I：

10.21437/Interspeech.2019-1938

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

The state-of-the-art neural network architecture named Transformer has been used successfully for many sequence-to-sequence transformation tasks. The advantage of this architecture is that it has a fast iteration speed in the training stage because there is no sequential operation as with recurrent neural networks (RNN). However, an RNN is still the best option for end-to-end automatic speech recognition (ASR) tasks in terms of overall training speed (i.e., convergence) and word error rate (WER) because of effective joint training and decoding methods. To realize a faster and more accurate ASR system, we combine Transformer and the advances in RNN-based ASR. In our experiments, we found that the training of Transformer is slower than that of RNN as regards the learning curve and integration with the naive language model (LM) is difficult. To address these problems, we integrate connectionist temporal classification (CTC) with Transformer for joint training and decoding. This approach makes training faster than with RNNs and assists LM integration. Our proposed ASR system realizes significant improvements in various ASR tasks. For example, it reduced the WERs from 11.1% to 4.5% on the Wall Street Journal and from 16.1% to 11.6% on the TED-LIUM by introducing CTC and LM integration into the Transformer baseline.

引用

页码：1408 / 1412

页数：5

共 50 条

[41] An End-to-End model for Vietnamese speech recognition
Van Huy Nguyen
2019 IEEE - RIVF INTERNATIONAL CONFERENCE ON COMPUTING AND COMMUNICATION TECHNOLOGIES (RIVF), 2019, : 307 - 312
[42] Investigation of Transformer based Spelling Correction Model for CTC-based End-to-End Mandarin Speech Recognition
Zhang, Shiliang
Lei, Ming
Yan, Zhijie
INTERSPEECH 2019, 2019, : 2180 - 2184
[43] ADVERSARIAL TRAINING OF END-TO-END SPEECH RECOGNITION USING A CRITICIZING LANGUAGE MODEL
Liu, Alexander H.
Lee, Hung-yi
Lee, Lin-shan
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6176 - 6180
[44] NEURAL-FST CLASS LANGUAGE MODEL FOR END-TO-END SPEECH RECOGNITION
Bruguier, Antoine
Le, Duc
Prabhavalkar, Rohit
Li, Dangna
Liu, Zhe
Wang, Bo
Chang, Eun
Peng, Fuchun
Kalinli, Ozlem
Seltzer, Michael L.
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6107 - 6111
[45] Transformer-Based End-to-End Anatomical and Functional Image Fusion
Zhang, Jing
Liu, Aiping
Wang, Dan
Liu, Yu
Wang, Z. Jane
Chen, Xun
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2022, 71
[46] Transformer-based End-to-End Object Detection in Aerial Images
Vo, Nguyen D.
Le, Nguyen
Ngo, Giang
Doan, Du
Le, Do
Nguyen, Khang
INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2023, 14 (10) : 1072 - 1079
[47] End-to-End Transformer-Based Models in Textual-Based NLP
Rahali, Abir
Akhloufi, Moulay A.
AI, 2023, 4 (01) : 54 - 110
[48] END-TO-END SPEECH RECOGNITION WITH WORD-BASED RNN LANGUAGE MODELS
Hori, Takaaki
Cho, Jaejin
Watanabe, Shinji
2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 389 - 396
[49] An End-to-End Spatial-Temporal Transformer Model for Surgical Action Triplet Recognition
Zou, Xiaoyang
Yu, Derong
Tao, Rong
Zheng, Guoyan
12TH ASIAN-PACIFIC CONFERENCE ON MEDICAL AND BIOLOGICAL ENGINEERING, VOL 2, APCMBE 2023, 2024, 104 : 114 - 120
[50] Location-Based End-to-End Speech Recognition with Multiple Language Models
Lin, Zhijie
Lin, Kaiyang
Chen, Shiling
Li, Linlin
Zhao, Zhou
THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 9975 - 9976

← 1 2 3 4 5 →