An Overview of End-to-End Automatic Speech Recognition

被引：131

作者：

Wang, Dong ^{[1
,2
]}

Wang, Xiaodong ^{[1
,2
]}

Lv, Shaohe ^{[1
,2
]}

机构：

[1] Natl Univ Def Technol, Sci & Technol Parallel & Distributed Proc Lab, Changsha 410073, Hunan, Peoples R China

[2] Natl Univ Def Technol, Coll Comp, Changsha 410073, Hunan, Peoples R China

来源：

SYMMETRY-BASEL | 2019年 / 11卷 / 08期

关键词：

automatic speech recognition; end-to-end; deep learning; neural network; CTC; RNN-transducer; attention; HMM;

D O I：

10.3390/sym11081018

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

Automatic speech recognition, especially large vocabulary continuous speech recognition, is an important issue in the field of machine learning. For a long time, the hidden Markov model (HMM)-Gaussian mixed model (GMM) has been the mainstream speech recognition framework. But recently, HMM-deep neural network (DNN) model and the end-to-end model using deep learning has achieved performance beyond HMM-GMM. Both using deep learning techniques, these two models have comparable performances. However, the HMM-DNN model itself is limited by various unfavorable factors such as data forced segmentation alignment, independent hypothesis, and multi-module individual training inherited from HMM, while the end-to-end model has a simplified model, joint training, direct output, no need to force data alignment and other advantages. Therefore, the end-to-end model is an important research direction of speech recognition. In this paper we review the development of end-to-end model. This paper first introduces the basic ideas, advantages and disadvantages of HMM-based model and end-to-end models, and points out that end-to-end model is the development direction of speech recognition. Then the article focuses on the principles, progress and research hotspots of three different end-to-end models, which are connectionist temporal classification (CTC)-based, recurrent neural network (RNN)-transducer and attention-based, and makes theoretically and experimentally detailed comparisons. Their respective advantages and disadvantages and the possible future development of the end-to-end model are finally pointed out. Automatic speech recognition is a pattern recognition task in the field of computer science, which is a subject area of Symmetry.

引用

页数：26

共 50 条

[21] A Transformer-Based End-to-End Automatic Speech Recognition Algorithm
Dong, Fang
Qian, Yiyang
Wang, Tianlei
Liu, Peng
Cao, Jiuwen
IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 1592 - 1596
[22] End-to-End Speech Recognition in Russian
Markovnikov, Nikita
Kipyatkova, Irina
Lyakso, Elena
SPEECH AND COMPUTER (SPECOM 2018), 2018, 11096 : 377 - 386
[23] END-TO-END MULTIMODAL SPEECH RECOGNITION
Palaskar, Shruti
Sanabria, Ramon
Metze, Florian
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5774 - 5778
[24] End-to-end Accented Speech Recognition
Viglino, Thibault
Motlicek, Petr
Cernak, Milos
INTERSPEECH 2019, 2019, : 2140 - 2144
[25] Multichannel End-to-end Speech Recognition
Ochiai, Tsubasa
Watanabe, Shinji
Hori, Takaaki
Hershey, John R.
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 70, 2017, 70
[26] END-TO-END AUDIOVISUAL SPEECH RECOGNITION
Petridis, Stavros
Stafylakis, Themos
Ma, Pingchuan
Cai, Feipeng
Tzimiropoulos, Georgios
Pantic, Maja
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 6548 - 6552
[27] END-TO-END ANCHORED SPEECH RECOGNITION
Wang, Yiming
Fan, Xing
Chen, I-Fan
Liu, Yuzong
Chen, Tongfei
Hoffmeister, Bjorn
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 7090 - 7094
[28] IMPROVING UNSUPERVISED STYLE TRANSFER IN END-TO-END SPEECH SYNTHESIS WITH END-TO-END SPEECH RECOGNITION
Liu, Da-Rong
Yang, Chi-Yu
Wu, Szu-Lin
Lee, Hung-Yi
2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 640 - 647
[29] SFA: Searching faster architectures for end-to-end automatic speech recognition models
Liu, Yukun
Li, Ta
Zhang, Pengyuan
Yan, Yonghong
COMPUTER SPEECH AND LANGUAGE, 2023, 81
[30] AN END-TO-END APPROACH TO JOINT SOCIAL SIGNAL DETECTION AND AUTOMATIC SPEECH RECOGNITION
Inaguma, Hirofumi
Mimura, Masato
Inoue, Koji
Yoshii, Kazuyoshi
Kawahara, Tatsuya
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 6214 - 6218

← 1 2 3 4 5 →