Key Frame Mechanism for Efficient Conformer Based End-to-End Speech Recognition

被引：0

作者：

Fan, Peng ^{[1
]}

Shan, Changhao ^{[2
]}

Sun, Sining ^{[2
]}

Yang, Qing ^{[2
]}

Zhang, Jianwei ^{[3
]}

机构：

[1] Sichuan Univ, Natl Key Lab Fundamental Sci Synthet Vis, Chengdu 610065, Peoples R China

[2] Du Xiaoman Financial, Beijing 100089, Peoples R China

[3] Sichuan Univ, Coll Comp Sci, Chengdu 610065, Peoples R China

来源：

IEEE SIGNAL PROCESSING LETTERS | 2023年 / 30卷

关键词：

Automatic speech recognition; self-attention; key frame; signal processing; drop frame;

D O I：

10.1109/LSP.2023.3327585

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Recently, Conformer as a backbone network for end-to-end automatic speech recognition achieved state-of-the-art performance. The Conformer block leverages a self-attention mechanism to capture global information, along with a convolutional neural network to capture local information, resulting in improved performance. However, the Conformer-based model encounters an issue with the self-attention mechanism, as computational complexity grows quadratically with the length of the input sequence. Inspired by previous Connectionist Temporal Classification (CTC) guided blank skipping during decoding, we introduce intermediate CTC outputs as guidance into the downsampling procedure of the Conformer encoder. We define the frame with non-blank output as key frame. Specifically, we introduce the key frame-based self-attention (KFSA) mechanism, a novel method to reduce the computation of the self-attention mechanism using key frames. The structure of our proposed approach comprises two encoders. Following the initial encoder, we introduce an intermediate CTC loss function to compute the label frame, enabling us to extract the key frames and blank frames for KFSA. Furthermore, we introduce the key frame-based downsampling (KFDS) mechanism to operate on high-dimensional acoustic features directly and drop the frames corresponding to blank labels, which results in new acoustic feature sequences as input to the second encoder. By using the proposed method, which achieves comparable or higher performance than vanilla Conformer and other similar work such as Efficient Conformer. Meantime, our proposed method can discard more than 60% useless frames during model training and inference, which will accelerate the inference speed significantly.

引用

页码：1612 / 1616

页数：5

共 50 条

[1] Efficient Conformer with Prob-Sparse Attention Mechanism for End-to-End Speech Recognition
Wang, Xiong
Sun, Sining
Xie, Lei
Ma, Long
INTERSPEECH 2021, 2021, : 4578 - 4582
[2] Conformer with lexicon transducer for Korean end-to-end speech recognition
Son, Hyunsoo
Park, Hosung
Kim, Gyujin
Cho, Eunsoo
Kim, Ji-Hwan
JOURNAL OF THE ACOUSTICAL SOCIETY OF KOREA, 2021, 40 (05): : 530 - 536
[3] Conformer-based End-to-end Speech Recognition With Rotary Position Embedding
Li, Shengqiang
Xu, Menglong
Zhang, Xiao-Lei
2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 443 - 447
[4] An End-to-end Speech Recognition Algorithm based on Attention Mechanism
Chen, Jia-nan
Gao, Shuang
Sun, Han-zhe
Liu, Xiao-hui
Wang, Zi-ning
Zheng, Yan
PROCEEDINGS OF THE 39TH CHINESE CONTROL CONFERENCE, 2020, : 2935 - 2940
[5] Hybrid CTC/Attention End-to-End Chinese Speech Recognition Enhanced by Conformer
Chen, Ge
Xie, Xukang
Sun, Jun
Chen, Qidong
Computer Engineering and Applications, 2024, 59 (04) : 97 - 103
[6] Conformer Parrotron: a Faster and Stronger End-to-end Speech Conversion and Recognition Model for Atypical Speech
Chen, Zhehuai
Ramabhadran, Bhuvana
Biadsy, Fadi
Zhang, Xia
Chen, Youzheng
Jiang, Liyang
Chu, Fang
Doshi, Rohan
Moreno, Pedro J.
INTERSPEECH 2021, 2021, : 4828 - 4832
[7] UNIFIED END-TO-END SPEECH RECOGNITION AND ENDPOINTING FOR FAST AND EFFICIENT SPEECH SYSTEMS
Bijwadia, Shaan
Chang, Shuo-yiin
Li, Bo
Sainath, Tara
Zhang, Chao
He, Yanzhang
2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 310 - 316
[8] End-to-End Speech Recognition in Russian
Markovnikov, Nikita
Kipyatkova, Irina
Lyakso, Elena
SPEECH AND COMPUTER (SPECOM 2018), 2018, 11096 : 377 - 386
[9] END-TO-END MULTIMODAL SPEECH RECOGNITION
Palaskar, Shruti
Sanabria, Ramon
Metze, Florian
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5774 - 5778
[10] Overview of end-to-end speech recognition
Wang, Song
Li, Guanyu
2018 INTERNATIONAL SYMPOSIUM ON POWER ELECTRONICS AND CONTROL ENGINEERING (ISPECE 2018), 2019, 1187

← 1 2 3 4 5 →