Improving Transformer-based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration

被引:134
|
作者
Karita, Shigeki [1 ]
Soplin, Nelson Enrique Yalta [2 ]
Watanabe, Shinji [3 ]
Delcroix, Marc [1 ]
Ogawa, Atsunori [1 ]
Nakatani, Tomohiro [1 ]
机构
[1] NTT Commun Sci Labs, Kyoto, Japan
[2] Waseda Univ, Tokyo, Japan
[3] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD 21218 USA
来源
关键词
speech recognition; Transformer; connectionist temporal classification; language model;
D O I
10.21437/Interspeech.2019-1938
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
The state-of-the-art neural network architecture named Transformer has been used successfully for many sequence-to-sequence transformation tasks. The advantage of this architecture is that it has a fast iteration speed in the training stage because there is no sequential operation as with recurrent neural networks (RNN). However, an RNN is still the best option for end-to-end automatic speech recognition (ASR) tasks in terms of overall training speed (i.e., convergence) and word error rate (WER) because of effective joint training and decoding methods. To realize a faster and more accurate ASR system, we combine Transformer and the advances in RNN-based ASR. In our experiments, we found that the training of Transformer is slower than that of RNN as regards the learning curve and integration with the naive language model (LM) is difficult. To address these problems, we integrate connectionist temporal classification (CTC) with Transformer for joint training and decoding. This approach makes training faster than with RNNs and assists LM integration. Our proposed ASR system realizes significant improvements in various ASR tasks. For example, it reduced the WERs from 11.1% to 4.5% on the Wall Street Journal and from 16.1% to 11.6% on the TED-LIUM by introducing CTC and LM integration into the Transformer baseline.
引用
收藏
页码:1408 / 1412
页数:5
相关论文
共 50 条
  • [21] End-to-end Keywords Spotting Based on Connectionist Temporal Classification for Mandarin
    Bai, Ye
    Yi, Jiangyan
    Ni, Hao
    Wen, Zhengqi
    Liu, Bin
    Li, Ya
    Tao, Jianhua
    2016 10TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2016,
  • [22] Spatial-temporal transformer for end-to-end sign language recognition
    Cui, Zhenchao
    Zhang, Wenbo
    Li, Zhaoxin
    Wang, Zhaoqi
    COMPLEX & INTELLIGENT SYSTEMS, 2023, 9 (04) : 4645 - 4656
  • [23] Semantic Mask for Transformer based End-to-End Speech Recognition
    Wang, Chengyi
    Wu, Yu
    Du, Yujiao
    Li, Jinyu
    Liu, Shujie
    Lu, Liang
    Ren, Shuo
    Ye, Guoli
    Zhao, Sheng
    Zhou, Ming
    INTERSPEECH 2020, 2020, : 971 - 975
  • [24] Multi-Encoder Learning and Stream Fusion for Transformer-Based End-to-End Automatic Speech Recognition
    Lohrenz, Timo
    Li, Zhengyang
    Fingscheidt, Tim
    INTERSPEECH 2021, 2021, : 2846 - 2850
  • [25] Fast offline transformer-based end-to-end automatic speech recognition for real-world applications
    Oh, Yoo Rhee
    Park, Kiyoung
    Park, Jeon Gue
    ETRI JOURNAL, 2022, 44 (03) : 476 - 490
  • [26] End to end transformer-based contextual speech recognition based on pointer network
    Lin, Binghuai
    Wang, Liyuan
    INTERSPEECH 2021, 2021, : 2087 - 2091
  • [27] Transformer Model Compression for End-to-End Speech Recognition on Mobile Devices
    Ben Letaifa, Leila
    Rouas, Jean-Luc
    2022 30TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2022), 2022, : 439 - 443
  • [28] HyperSFormer: A Transformer-Based End-to-End Hyperspectral Image Classification Method for Crop Classification
    Xie, Jiaxing
    Hua, Jiajun
    Chen, Shaonan
    Wu, Peiwen
    Gao, Peng
    Sun, Daozong
    Lyu, Zhendong
    Lyu, Shilei
    Xue, Xiuyun
    Lu, Jianqiang
    REMOTE SENSING, 2023, 15 (14)
  • [29] Improving Mandarin End-to-End Speech Recognition With Word N-Gram Language Model
    Tian, Jinchuan
    Yu, Jianwei
    Weng, Chao
    Zou, Yuexian
    Yu, Dong
    IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 812 - 816
  • [30] End-to-End Speech Recognition of Tamil Language
    Changrampadi, Mohamed Hashim
    Shahina, A.
    Narayanan, M. Badri
    Khan, A. Nayeemulla
    INTELLIGENT AUTOMATION AND SOFT COMPUTING, 2022, 32 (02): : 1309 - 1323