DEEP-FSMN FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION

被引：0

作者：

Zhang, Shiliang ^{[1
]}

Lei, Ming ^{[1
]}

Yan, Zhijie ^{[1
]}

Dai, Lirong ^{[2
]}

机构：

[1] Alibaba Inc, Hangzhou, Zhejiang, Peoples R China

[2] USTC, NELSLIP, Hefei, Anhui, Peoples R China

来源：

2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2018年

关键词：

DFSMN; FSMN; LFR; LVCSR; BLSTM; NETWORKS;

D O I：

暂无

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

In this paper, we present an improved feedforward sequential memory networks (FSMN) architecture, namely Deep-FSMN (DFSMN), by introducing skip connections between memory blocks in adjacent layers. These skip connections enable the information flow across different layers and thus alleviate the gradient vanishing problem when building very deep structure. As a result, DFSMN significantly benefits from these skip connections and deep structure. We have compared the performance of DFSMN to BLSTM both with and without lower frame rate (LFR) on several large speech recognition tasks, including English and Mandarin. Experimental results shown that DFSMN can consistently outperform BLSTM with dramatic gain, especially trained with LFR using CD-Phone as modeling units. In the 20000 hours Fisher (FSH) task, the proposed DFSMN can achieve a word error rate of 9.4% by purely using the cross-entropy criterion and decoding with a 3-gram language model, which achieves a 1.5% absolute improvement compared to the BLSTM. In a 20000 hours Mandarin recognition task, the LFR trained DFSMN can achieve more than 20% relative improvement compared to the LFR trained BLSTM. Moreover, we can easily design the lookahead filter order of the memory blocks in DFSMN to control the latency for real-time applications.

引用

页码：5869 / 5873

页数：5

共 50 条

[21] A Segmental CRF Approach to Large Vocabulary Continuous Speech Recognition
Zweig, Geoffrey
Nguyen, Patrick
2009 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION & UNDERSTANDING (ASRU 2009), 2009, : 152 - 157
[22] A large vocabulary continuous speech recognition system for Persian language
Sameti, Hossein
Veisi, Hadi
Bahrani, Mohammad
Babaali, Bagher
Hosseinzadeh, Khosro
EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2011, : 1 - 12
[23] A review of large-vocabulary continuous-speech recognition
Young, S
IEEE SIGNAL PROCESSING MAGAZINE, 1996, 13 (05) : 45 - 57
[24] A LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION SYSTEM WITH HIGH PREDICTABILITY
SHIGENAGA, M
SEKIGUCHI, Y
YAMAGUCHI, T
MASUDA, R
IEICE TRANSACTIONS ON COMMUNICATIONS ELECTRONICS INFORMATION AND SYSTEMS, 1991, 74 (07): : 1817 - 1825
[25] Feature selection in mandarin large vocabulary continuous speech recognition
Zhu, X
Chen, YN
Liu, J
Liu, RS
2002 6TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING PROCEEDINGS, VOLS I AND II, 2002, : 508 - 511
[26] Using a transcription graph for large vocabulary continuous speech recognition
Li, Z
OShaughnessy, D
1996 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, CONFERENCE PROCEEDINGS, VOLS 1-6, 1996, : 121 - 124
[27] DISTRIBUTED SUBMODULAR MAXIMIZATION FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION
Qi, Jun
Liu, Xu
Kamijo, Shunshuke
Tejedor, Javier
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 2501 - 2505
[28] A word graph algorithm for large vocabulary continuous speech recognition
Ortmanns, S
Ney, H
Aubert, X
COMPUTER SPEECH AND LANGUAGE, 1997, 11 (01): : 43 - 72
[29] A large vocabulary continuous speech recognition system for Persian language
Hossein Sameti
Hadi Veisi
Mohammad Bahrani
Bagher Babaali
Khosro Hosseinzadeh
EURASIP Journal on Audio, Speech, and Music Processing, 2011
[30] Large Vocabulary Continuous Audio-Visual Speech Recognition
Sterpu, George
ICMI'18: PROCEEDINGS OF THE 20TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2018, : 538 - 541

← 1 2 3 4 5 →