An Overview of End-to-End Automatic Speech Recognition

被引：131

作者：

Wang, Dong ^{[1
,2
]}

Wang, Xiaodong ^{[1
,2
]}

Lv, Shaohe ^{[1
,2
]}

机构：

[1] Natl Univ Def Technol, Sci & Technol Parallel & Distributed Proc Lab, Changsha 410073, Hunan, Peoples R China

[2] Natl Univ Def Technol, Coll Comp, Changsha 410073, Hunan, Peoples R China

来源：

SYMMETRY-BASEL | 2019年 / 11卷 / 08期

关键词：

automatic speech recognition; end-to-end; deep learning; neural network; CTC; RNN-transducer; attention; HMM;

D O I：

10.3390/sym11081018

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

Automatic speech recognition, especially large vocabulary continuous speech recognition, is an important issue in the field of machine learning. For a long time, the hidden Markov model (HMM)-Gaussian mixed model (GMM) has been the mainstream speech recognition framework. But recently, HMM-deep neural network (DNN) model and the end-to-end model using deep learning has achieved performance beyond HMM-GMM. Both using deep learning techniques, these two models have comparable performances. However, the HMM-DNN model itself is limited by various unfavorable factors such as data forced segmentation alignment, independent hypothesis, and multi-module individual training inherited from HMM, while the end-to-end model has a simplified model, joint training, direct output, no need to force data alignment and other advantages. Therefore, the end-to-end model is an important research direction of speech recognition. In this paper we review the development of end-to-end model. This paper first introduces the basic ideas, advantages and disadvantages of HMM-based model and end-to-end models, and points out that end-to-end model is the development direction of speech recognition. Then the article focuses on the principles, progress and research hotspots of three different end-to-end models, which are connectionist temporal classification (CTC)-based, recurrent neural network (RNN)-transducer and attention-based, and makes theoretically and experimentally detailed comparisons. Their respective advantages and disadvantages and the possible future development of the end-to-end model are finally pointed out. Automatic speech recognition is a pattern recognition task in the field of computer science, which is a subject area of Symmetry.

引用

页数：26

共 50 条

[1] Overview of end-to-end speech recognition
Wang, Song
Li, Guanyu
2018 INTERNATIONAL SYMPOSIUM ON POWER ELECTRONICS AND CONTROL ENGINEERING (ISPECE 2018), 2019, 1187
[2] INCREMENTAL LEARNING FOR END-TO-END AUTOMATIC SPEECH RECOGNITION
Fu, Li
Li, Xiaoxiao
Zi, Libo
Zhang, Zhengchen
Wu, Youzheng
He, Xiaodong
Zhou, Bowen
2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 320 - 327
[3] Recent Advances in End-to-End Automatic Speech Recognition
Li, Jinyu
APSIPA TRANSACTIONS ON SIGNAL AND INFORMATION PROCESSING, 2022, 11 (01)
[4] Inverted Alignments for End-to-End Automatic Speech Recognition
Doetsch, Patrick
Hannemann, Mirko
Schluter, Ralf
Ney, Hermann
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2017, 11 (08) : 1265 - 1273
[5] Continual Learning for Monolingual End-to-End Automatic Speech Recognition
Vander Eeckt, Steven
Van Hamme, Hugo
2022 30TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2022), 2022, : 459 - 463
[6] STRUCTURED SPARSE ATTENTION FOR END-TO-END AUTOMATIC SPEECH RECOGNITION
Xue, Jiabin
Zheng, Tieran
Han, Jiqing
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7044 - 7048
[7] The Processing of Stress in End-to-End Automatic Speech Recognition Models
Bentum, Martijn
ten Bosch, Louis
Lentz, Tom
INTERSPEECH 2024, 2024, : 2350 - 2354
[8] End-to-End Automatic Speech Recognition with Deep Mutual Learning
Masumura, Ryo
Ihori, Mana
Takashima, Akihiko
Tanaka, Tomohiro
Ashihara, Takanori
2020 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2020, : 632 - 637
[9] LEARNING A SUBWORD INVENTORY JOINTLY WITH END-TO-END AUTOMATIC SPEECH RECOGNITION
Drexler, Jennifer
Glass, James
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6439 - 6443
[10] Contextualized End-to-end Automatic Speech Recognition with Intermediate Biasing Loss
Shakeel, Muhammad
Sudo, Yui
Peng, Yifan
Watanabe, Shinji
INTERSPEECH 2024, 2024, : 3909 - 3913

← 1 2 3 4 5 →