Online Compressive Transformer for End-to-End Speech Recognition

被引：10

作者：

Leong, Chi-Hang ^{[1
]}

Huang, Yu-Han ^{[1
]}

Chien, Jen-Tzung ^{[1
]}

机构：

[1] Natl Yang Ming Chiao Tung Univ, Dept Elect & Comp Engn, Taipei, Taiwan

来源：

INTERSPEECH 2021 | 2021年

关键词：

Online processing and learning; compressive transformer; end-to-end speech recognition; SELF-ATTENTION;

D O I：

10.21437/Interspeech.2021-545

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Traditionally, transformer with connectionist temporal classification (CTC) was developed for offline speech recognition where the transcription was generated after the whole utterance has been spoken. However, it is crucial to carry out online transcription of speech signal for many applications including live broadcasting and meeting. This paper presents an online transformer for real-time speech recognition where online transcription is generated chunk by chuck. In particular, an online compressive transformer (OCT) is proposed for end-to-end speech recognition. This OCT aims to generate immediate transcription for each audio chunk while the comparable performance with offline speech recognition can be still achieved. In the implementation, OCT tightly combines with both CTC and recurrent neural network transducer by minimizing their losses for training. In addition, this OCT systematically merges with compressive memory to reduce potential performance degradation due to online processing. This degradation is caused by online transcription which is generated by the chunks without history information. The experiments on speech recognition show that OCT does not only obtain comparable performance with offline transformer, but also work faster than the baseline model.

引用

页码：2082 / 2086

页数：5

共 50 条

[31] Segment boundary detection directed attention for online end-to-end speech recognition
Hou, Junfeng
Guo, Wu
Song, Yan
Dai, Li-Rong
EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2020, 2020 (01)
[32] SIMPLIFIED SELF-ATTENTION FOR TRANSFORMER-BASED END-TO-END SPEECH RECOGNITION
Luo, Haoneng
Zhang, Shiliang
Lei, Ming
Xie, Lei
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 75 - 81
[33] TRANSFORMER-BASED END-TO-END SPEECH RECOGNITION WITH LOCAL DENSE SYNTHESIZER ATTENTION
Xu, Menglong
Li, Shengqiang
Zhang, Xiao-Lei
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5899 - 5903
[34] LIGHTWEIGHT AND EFFICIENT END-TO-END SPEECH RECOGNITION USING LOW-RANK TRANSFORMER
Winata, Genta Indra
Cahyawijaya, Samuel
Lin, Zhaojiang
Liu, Zihan
Fung, Pascale
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6144 - 6148
[35] A study of transformer-based end-to-end speech recognition system for Kazakh language
Mamyrbayev Orken
Oralbekova Dina
Alimhan Keylan
Turdalykyzy Tolganay
Othman Mohamed
Scientific Reports, 12
[36] END-TO-END TRAINING OF A LARGE VOCABULARY END-TO-END SPEECH RECOGNITION SYSTEM
Kim, Chanwoo
Kim, Sungsoo
Kim, Kwangyoun
Kumar, Mehul
Kim, Jiyeon
Lee, Kyungmin
Han, Changwoo
Garg, Abhinav
Kim, Eunhyang
Shin, Minkyoo
Singh, Shatrughan
Heck, Larry
Gowda, Dhananjaya
2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 562 - 569
[37] Spike-Triggered Non-Autoregressive Transformer for End-to-End Speech Recognition
Tian, Zhengkun
Yi, Jiangyan
Tao, Jianhua
Bai, Ye
Zhang, Shuai
Wen, Zhengqi
INTERSPEECH 2020, 2020, : 5026 - 5030
[38] A study of transformer-based end-to-end speech recognition system for Kazakh language
Mamyrbayev, Orken
Oralbekova, Dina
Alimhan, Keylan
Turdalykyzy, Tolganay
Othman, Mohamed
SCIENTIFIC REPORTS, 2022, 12 (01)
[39] Evolved Speech-Transformer: Applying Neural Architecture Search to End-to-End Automatic Speech Recognition
Kim, Jihwan
Wang, Jisung
Kim, Sangki
Lee, Yeha
INTERSPEECH 2020, 2020, : 1788 - 1792
[40] SYNCHRONOUS TRANSFORMERS FOR END-TO-END SPEECH RECOGNITION
Tian, Zhengkun
Yi, Jiangyan
Bai, Ye
Tao, Jianhua
Zhang, Shuai
Wen, Zhengqi
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7884 - 7888

← 1 2 3 4 5 →