Online Compressive Transformer for End-to-End Speech Recognition

被引：10

作者：

Leong, Chi-Hang ^{[1
]}

Huang, Yu-Han ^{[1
]}

Chien, Jen-Tzung ^{[1
]}

机构：

[1] Natl Yang Ming Chiao Tung Univ, Dept Elect & Comp Engn, Taipei, Taiwan

来源：

INTERSPEECH 2021 | 2021年

关键词：

Online processing and learning; compressive transformer; end-to-end speech recognition; SELF-ATTENTION;

D O I：

10.21437/Interspeech.2021-545

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Traditionally, transformer with connectionist temporal classification (CTC) was developed for offline speech recognition where the transcription was generated after the whole utterance has been spoken. However, it is crucial to carry out online transcription of speech signal for many applications including live broadcasting and meeting. This paper presents an online transformer for real-time speech recognition where online transcription is generated chunk by chuck. In particular, an online compressive transformer (OCT) is proposed for end-to-end speech recognition. This OCT aims to generate immediate transcription for each audio chunk while the comparable performance with offline speech recognition can be still achieved. In the implementation, OCT tightly combines with both CTC and recurrent neural network transducer by minimizing their losses for training. In addition, this OCT systematically merges with compressive memory to reduce potential performance degradation due to online processing. This degradation is caused by online transcription which is generated by the chunks without history information. The experiments on speech recognition show that OCT does not only obtain comparable performance with offline transformer, but also work faster than the baseline model.

引用

页码：2082 / 2086

页数：5

共 50 条

[21] End-to-End Speech Recognition in Russian
Markovnikov, Nikita
Kipyatkova, Irina
Lyakso, Elena
SPEECH AND COMPUTER (SPECOM 2018), 2018, 11096 : 377 - 386
[22] END-TO-END MULTIMODAL SPEECH RECOGNITION
Palaskar, Shruti
Sanabria, Ramon
Metze, Florian
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5774 - 5778
[23] Overview of end-to-end speech recognition
Wang, Song
Li, Guanyu
2018 INTERNATIONAL SYMPOSIUM ON POWER ELECTRONICS AND CONTROL ENGINEERING (ISPECE 2018), 2019, 1187
[24] End-to-end Accented Speech Recognition
Viglino, Thibault
Motlicek, Petr
Cernak, Milos
INTERSPEECH 2019, 2019, : 2140 - 2144
[25] Multichannel End-to-end Speech Recognition
Ochiai, Tsubasa
Watanabe, Shinji
Hori, Takaaki
Hershey, John R.
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 70, 2017, 70
[26] END-TO-END AUDIOVISUAL SPEECH RECOGNITION
Petridis, Stavros
Stafylakis, Themos
Ma, Pingchuan
Cai, Feipeng
Tzimiropoulos, Georgios
Pantic, Maja
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 6548 - 6552
[27] END-TO-END ANCHORED SPEECH RECOGNITION
Wang, Yiming
Fan, Xing
Chen, I-Fan
Liu, Yuzong
Chen, Tongfei
Hoffmeister, Bjorn
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 7090 - 7094
[28] IMPROVING UNSUPERVISED STYLE TRANSFER IN END-TO-END SPEECH SYNTHESIS WITH END-TO-END SPEECH RECOGNITION
Liu, Da-Rong
Yang, Chi-Yu
Wu, Szu-Lin
Lee, Hung-Yi
2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 640 - 647
[29] Segment boundary detection directed attention for online end-to-end speech recognition
Junfeng Hou
Wu Guo
Yan Song
Li-Rong Dai
EURASIP Journal on Audio, Speech, and Music Processing, 2020
[30] Online Hybrid CTC/Attention End-to-End Automatic Speech Recognition Architecture
Miao, Haoran
Cheng, Gaofeng
Zhang, Pengyuan
Yan, Yonghong
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 : 1452 - 1465

← 1 2 3 4 5 →