Key Frame Mechanism for Efficient Conformer Based End-to-End Speech Recognition

被引：0

作者：

Fan, Peng ^{[1
]}

Shan, Changhao ^{[2
]}

Sun, Sining ^{[2
]}

Yang, Qing ^{[2
]}

Zhang, Jianwei ^{[3
]}

机构：

[1] Sichuan Univ, Natl Key Lab Fundamental Sci Synthet Vis, Chengdu 610065, Peoples R China

[2] Du Xiaoman Financial, Beijing 100089, Peoples R China

[3] Sichuan Univ, Coll Comp Sci, Chengdu 610065, Peoples R China

来源：

IEEE SIGNAL PROCESSING LETTERS | 2023年 / 30卷

关键词：

Automatic speech recognition; self-attention; key frame; signal processing; drop frame;

D O I：

10.1109/LSP.2023.3327585

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Recently, Conformer as a backbone network for end-to-end automatic speech recognition achieved state-of-the-art performance. The Conformer block leverages a self-attention mechanism to capture global information, along with a convolutional neural network to capture local information, resulting in improved performance. However, the Conformer-based model encounters an issue with the self-attention mechanism, as computational complexity grows quadratically with the length of the input sequence. Inspired by previous Connectionist Temporal Classification (CTC) guided blank skipping during decoding, we introduce intermediate CTC outputs as guidance into the downsampling procedure of the Conformer encoder. We define the frame with non-blank output as key frame. Specifically, we introduce the key frame-based self-attention (KFSA) mechanism, a novel method to reduce the computation of the self-attention mechanism using key frames. The structure of our proposed approach comprises two encoders. Following the initial encoder, we introduce an intermediate CTC loss function to compute the label frame, enabling us to extract the key frames and blank frames for KFSA. Furthermore, we introduce the key frame-based downsampling (KFDS) mechanism to operate on high-dimensional acoustic features directly and drop the frames corresponding to blank labels, which results in new acoustic feature sequences as input to the second encoder. By using the proposed method, which achieves comparable or higher performance than vanilla Conformer and other similar work such as Efficient Conformer. Meantime, our proposed method can discard more than 60% useless frames during model training and inference, which will accelerate the inference speed significantly.

引用

页码：1612 / 1616

页数：5

共 50 条

[21] Lattice Based Transcription Loss for End-to-End Speech Recognition
Jian Kang
Wei-Qiang Zhang
Wei-Wei Liu
Jia Liu
Michael T. Johnson
Journal of Signal Processing Systems, 2018, 90 : 1013 - 1023
[22] Lattice Based Transcription Loss for End-to-End Speech Recognition
Kang, Jian
Zhang, Wei-Qiang
Liu, Wei-Wei
Liu, Jia
Johnson, Michael T.
JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, 2018, 90 (07): : 1013 - 1023
[23] END-TO-END TRAINING OF A LARGE VOCABULARY END-TO-END SPEECH RECOGNITION SYSTEM
Kim, Chanwoo
Kim, Sungsoo
Kim, Kwangyoun
Kumar, Mehul
Kim, Jiyeon
Lee, Kyungmin
Han, Changwoo
Garg, Abhinav
Kim, Eunhyang
Shin, Minkyoo
Singh, Shatrughan
Heck, Larry
Gowda, Dhananjaya
2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 562 - 569
[24] SYNCHRONOUS TRANSFORMERS FOR END-TO-END SPEECH RECOGNITION
Tian, Zhengkun
Yi, Jiangyan
Bai, Ye
Tao, Jianhua
Zhang, Shuai
Wen, Zhengqi
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7884 - 7888
[25] End-to-End Speech Recognition of Tamil Language
Changrampadi, Mohamed Hashim
Shahina, A.
Narayanan, M. Badri
Khan, A. Nayeemulla
INTELLIGENT AUTOMATION AND SOFT COMPUTING, 2022, 32 (02): : 1309 - 1323
[26] End-to-End Speech Recognition For Arabic Dialects
Seham Nasr
Rehab Duwairi
Muhannad Quwaider
Arabian Journal for Science and Engineering, 2023, 48 : 10617 - 10633
[27] PARAMETER UNCERTAINTY FOR END-TO-END SPEECH RECOGNITION
Braun, Stefan
Liu, Shih-Chii
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 5636 - 5640
[28] END-TO-END VISUAL SPEECH RECOGNITION WITH LSTMS
Petridis, Stavros
Li, Zuwei
Pantic, Maja
2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 2592 - 2596
[29] An End-to-End model for Vietnamese speech recognition
Van Huy Nguyen
2019 IEEE - RIVF INTERNATIONAL CONFERENCE ON COMPUTING AND COMMUNICATION TECHNOLOGIES (RIVF), 2019, : 307 - 312
[30] Review of End-to-End Streaming Speech Recognition
Wang, Aohui
Zhang, Long
Song, Wenyu
Meng, Jie
Computer Engineering and Applications, 2024, 59 (02) : 22 - 33

← 1 2 3 4 5 →