Improving Transformer-based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration

被引:134
|
作者
Karita, Shigeki [1 ]
Soplin, Nelson Enrique Yalta [2 ]
Watanabe, Shinji [3 ]
Delcroix, Marc [1 ]
Ogawa, Atsunori [1 ]
Nakatani, Tomohiro [1 ]
机构
[1] NTT Commun Sci Labs, Kyoto, Japan
[2] Waseda Univ, Tokyo, Japan
[3] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD 21218 USA
来源
关键词
speech recognition; Transformer; connectionist temporal classification; language model;
D O I
10.21437/Interspeech.2019-1938
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
The state-of-the-art neural network architecture named Transformer has been used successfully for many sequence-to-sequence transformation tasks. The advantage of this architecture is that it has a fast iteration speed in the training stage because there is no sequential operation as with recurrent neural networks (RNN). However, an RNN is still the best option for end-to-end automatic speech recognition (ASR) tasks in terms of overall training speed (i.e., convergence) and word error rate (WER) because of effective joint training and decoding methods. To realize a faster and more accurate ASR system, we combine Transformer and the advances in RNN-based ASR. In our experiments, we found that the training of Transformer is slower than that of RNN as regards the learning curve and integration with the naive language model (LM) is difficult. To address these problems, we integrate connectionist temporal classification (CTC) with Transformer for joint training and decoding. This approach makes training faster than with RNNs and assists LM integration. Our proposed ASR system realizes significant improvements in various ASR tasks. For example, it reduced the WERs from 11.1% to 4.5% on the Wall Street Journal and from 16.1% to 11.6% on the TED-LIUM by introducing CTC and LM integration into the Transformer baseline.
引用
收藏
页码:1408 / 1412
页数:5
相关论文
共 50 条
  • [31] Online Compressive Transformer for End-to-End Speech Recognition
    Leong, Chi-Hang
    Huang, Yu-Han
    Chien, Jen-Tzung
    INTERSPEECH 2021, 2021, : 2082 - 2086
  • [32] Transformer-Based End-to-End Classification of Variable-Length Volumetric Data
    Oghbaie, Marzieh
    Araujo, Teresa
    Emre, Taha
    Schmidt-Erfurth, Ursula
    Bogunovic, Hrvoje
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT VI, 2023, 14225 : 358 - 367
  • [33] IMPROVING UNSUPERVISED STYLE TRANSFER IN END-TO-END SPEECH SYNTHESIS WITH END-TO-END SPEECH RECOGNITION
    Liu, Da-Rong
    Yang, Chi-Yu
    Wu, Szu-Lin
    Lee, Hung-Yi
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 640 - 647
  • [34] An End-to-End Chinese Speech Recognition Algorithm Integrating Language Model
    Lü, Kun-Ru
    Wu, Chun-Guo
    Liang, Yan-Chun
    Yuan, Yu-Ping
    Ren, Zhi-Min
    Zhou, You
    Shi, Xiao-Hu
    Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2021, 49 (11): : 2177 - 2185
  • [35] Variable Scale Pruning for Transformer Model Compression in End-to-End Speech Recognition
    Ben Letaifa, Leila
    Rouas, Jean-Luc
    ALGORITHMS, 2023, 16 (09)
  • [36] Hardware Accelerator for Transformer based End-to-End Automatic Speech Recognition System
    Yamini, Shaarada D.
    Mirishkar, Ganesh S.
    Vuppala, Anil Kumar
    Purini, Suresh
    2023 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS, IPDPSW, 2023, : 93 - 100
  • [37] IMPROVING END-TO-END SPEECH RECOGNITION WITH POLICY LEARNING
    Zhou, Yingbo
    Xiong, Caiming
    Socher, Richard
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5819 - 5823
  • [38] END-TO-END MULTI-CHANNEL TRANSFORMER FOR SPEECH RECOGNITION
    Chang, Feng-Ju
    Radfar, Martin
    Mouchtaris, Athanasios
    King, Brian
    Kunzmann, Siegfried
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5884 - 5888
  • [39] END-TO-END MULTI-SPEAKER SPEECH RECOGNITION WITH TRANSFORMER
    Chang, Xuankai
    Zhang, Wangyou
    Qian, Yanmin
    Le Roux, Jonathan
    Watanabe, Shinji
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6134 - 6138
  • [40] Study of Speech Recognition System Based on Transformer and Connectionist Temporal Classification Models for Low Resource Language
    Bansal, Shweta
    Sharan, Shambhu
    Agrawal, Shyam S.
    SPEECH AND COMPUTER, SPECOM 2022, 2022, 13721 : 56 - 63