Key Frame Mechanism for Efficient Conformer Based End-to-End Speech Recognition

被引：0

作者：

Fan, Peng ^{[1
]}

Shan, Changhao ^{[2
]}

Sun, Sining ^{[2
]}

Yang, Qing ^{[2
]}

Zhang, Jianwei ^{[3
]}

机构：

[1] Sichuan Univ, Natl Key Lab Fundamental Sci Synthet Vis, Chengdu 610065, Peoples R China

[2] Du Xiaoman Financial, Beijing 100089, Peoples R China

[3] Sichuan Univ, Coll Comp Sci, Chengdu 610065, Peoples R China

来源：

IEEE SIGNAL PROCESSING LETTERS | 2023年 / 30卷

关键词：

Automatic speech recognition; self-attention; key frame; signal processing; drop frame;

D O I：

10.1109/LSP.2023.3327585

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Recently, Conformer as a backbone network for end-to-end automatic speech recognition achieved state-of-the-art performance. The Conformer block leverages a self-attention mechanism to capture global information, along with a convolutional neural network to capture local information, resulting in improved performance. However, the Conformer-based model encounters an issue with the self-attention mechanism, as computational complexity grows quadratically with the length of the input sequence. Inspired by previous Connectionist Temporal Classification (CTC) guided blank skipping during decoding, we introduce intermediate CTC outputs as guidance into the downsampling procedure of the Conformer encoder. We define the frame with non-blank output as key frame. Specifically, we introduce the key frame-based self-attention (KFSA) mechanism, a novel method to reduce the computation of the self-attention mechanism using key frames. The structure of our proposed approach comprises two encoders. Following the initial encoder, we introduce an intermediate CTC loss function to compute the label frame, enabling us to extract the key frames and blank frames for KFSA. Furthermore, we introduce the key frame-based downsampling (KFDS) mechanism to operate on high-dimensional acoustic features directly and drop the frames corresponding to blank labels, which results in new acoustic feature sequences as input to the second encoder. By using the proposed method, which achieves comparable or higher performance than vanilla Conformer and other similar work such as Efficient Conformer. Meantime, our proposed method can discard more than 60% useless frames during model training and inference, which will accelerate the inference speed significantly.

引用

页码：1612 / 1616

页数：5

共 50 条

[41] BLSTM-BASED CONFIDENCE ESTIMATION FOR END-TO-END SPEECH RECOGNITION
Ogawa, Atsunori
Tawara, Naohiro
Kano, Takatomo
Delcroix, Marc
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6383 - 6387
[42] Residual Energy-Based Models for End-to-End Speech Recognition
Li, Qiujia
Zhang, Yu
Li, Bo
Cao, Liangliang
Woodland, Philip C.
INTERSPEECH 2021, 2021, : 4069 - 4073
[43] Confidence-based Ensembles of End-to-End Speech Recognition Models
Gitman, Igor
Lavrukhin, Vitaly
Laptev, Aleksandr
Ginsburg, Boris
INTERSPEECH 2023, 2023, : 1414 - 1418
[44] END-TO-END ATTENTION-BASED LARGE VOCABULARY SPEECH RECOGNITION
Bandanau, Dzmitry
Chorowski, Jan
Serdyuk, Dmitriy
Brakel, Philemon
Bengio, Yoshua
2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 4945 - 4949
[45] Insertion-Based Modeling for End-to-End Automatic Speech Recognition
Fujita, Yuya
Watanabe, Shinji
Omachi, Motoi
Chang, Xuankai
INTERSPEECH 2020, 2020, : 3660 - 3664
[46] End-to-end speech recognition system based on improved CLDNN structure
Feng, Yujie
Zhang, Yi
Xu, Xuan
PROCEEDINGS OF 2019 IEEE 8TH JOINT INTERNATIONAL INFORMATION TECHNOLOGY AND ARTIFICIAL INTELLIGENCE CONFERENCE (ITAIC 2019), 2019, : 538 - 542
[47] Speaker Adaptation for Attention-Based End-to-End Speech Recognition
Meng, Zhong
Gaur, Yashesh
Li, Jinyu
Gong, Yifan
INTERSPEECH 2019, 2019, : 241 - 245
[48] Large Margin Training for Attention Based End-to-End Speech Recognition
Wang, Peidong
Cui, Jia
Weng, Chao
Yu, Dong
INTERSPEECH 2019, 2019, : 246 - 250
[49] ATTENTION-BASED END-TO-END SPEECH RECOGNITION ON VOICE SEARCH
Shan, Changhao
Zhang, Junbo
Wang, Yujun
Xie, Lei
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4764 - 4768
[50] Gaussian Prediction based Attention for Online End-to-End Speech Recognition
Hou, Junfeng
Zhang, Shiliang
Dai, Lirong
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 3692 - 3696

← 1 2 3 4 5 →