Key Frame Mechanism for Efficient Conformer Based End-to-End Speech Recognition

被引:0
|
作者
Fan, Peng [1 ]
Shan, Changhao [2 ]
Sun, Sining [2 ]
Yang, Qing [2 ]
Zhang, Jianwei [3 ]
机构
[1] Sichuan Univ, Natl Key Lab Fundamental Sci Synthet Vis, Chengdu 610065, Peoples R China
[2] Du Xiaoman Financial, Beijing 100089, Peoples R China
[3] Sichuan Univ, Coll Comp Sci, Chengdu 610065, Peoples R China
关键词
Automatic speech recognition; self-attention; key frame; signal processing; drop frame;
D O I
10.1109/LSP.2023.3327585
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Recently, Conformer as a backbone network for end-to-end automatic speech recognition achieved state-of-the-art performance. The Conformer block leverages a self-attention mechanism to capture global information, along with a convolutional neural network to capture local information, resulting in improved performance. However, the Conformer-based model encounters an issue with the self-attention mechanism, as computational complexity grows quadratically with the length of the input sequence. Inspired by previous Connectionist Temporal Classification (CTC) guided blank skipping during decoding, we introduce intermediate CTC outputs as guidance into the downsampling procedure of the Conformer encoder. We define the frame with non-blank output as key frame. Specifically, we introduce the key frame-based self-attention (KFSA) mechanism, a novel method to reduce the computation of the self-attention mechanism using key frames. The structure of our proposed approach comprises two encoders. Following the initial encoder, we introduce an intermediate CTC loss function to compute the label frame, enabling us to extract the key frames and blank frames for KFSA. Furthermore, we introduce the key frame-based downsampling (KFDS) mechanism to operate on high-dimensional acoustic features directly and drop the frames corresponding to blank labels, which results in new acoustic feature sequences as input to the second encoder. By using the proposed method, which achieves comparable or higher performance than vanilla Conformer and other similar work such as Efficient Conformer. Meantime, our proposed method can discard more than 60% useless frames during model training and inference, which will accelerate the inference speed significantly.
引用
收藏
页码:1612 / 1616
页数:5
相关论文
共 50 条
  • [1] Efficient Conformer with Prob-Sparse Attention Mechanism for End-to-End Speech Recognition
    Wang, Xiong
    Sun, Sining
    Xie, Lei
    Ma, Long
    INTERSPEECH 2021, 2021, : 4578 - 4582
  • [2] Conformer with lexicon transducer for Korean end-to-end speech recognition
    Son, Hyunsoo
    Park, Hosung
    Kim, Gyujin
    Cho, Eunsoo
    Kim, Ji-Hwan
    JOURNAL OF THE ACOUSTICAL SOCIETY OF KOREA, 2021, 40 (05): : 530 - 536
  • [3] Conformer-based End-to-end Speech Recognition With Rotary Position Embedding
    Li, Shengqiang
    Xu, Menglong
    Zhang, Xiao-Lei
    2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 443 - 447
  • [4] An End-to-end Speech Recognition Algorithm based on Attention Mechanism
    Chen, Jia-nan
    Gao, Shuang
    Sun, Han-zhe
    Liu, Xiao-hui
    Wang, Zi-ning
    Zheng, Yan
    PROCEEDINGS OF THE 39TH CHINESE CONTROL CONFERENCE, 2020, : 2935 - 2940
  • [5] Hybrid CTC/Attention End-to-End Chinese Speech Recognition Enhanced by Conformer
    Chen, Ge
    Xie, Xukang
    Sun, Jun
    Chen, Qidong
    Computer Engineering and Applications, 2024, 59 (04) : 97 - 103
  • [6] Conformer Parrotron: a Faster and Stronger End-to-end Speech Conversion and Recognition Model for Atypical Speech
    Chen, Zhehuai
    Ramabhadran, Bhuvana
    Biadsy, Fadi
    Zhang, Xia
    Chen, Youzheng
    Jiang, Liyang
    Chu, Fang
    Doshi, Rohan
    Moreno, Pedro J.
    INTERSPEECH 2021, 2021, : 4828 - 4832
  • [7] UNIFIED END-TO-END SPEECH RECOGNITION AND ENDPOINTING FOR FAST AND EFFICIENT SPEECH SYSTEMS
    Bijwadia, Shaan
    Chang, Shuo-yiin
    Li, Bo
    Sainath, Tara
    Zhang, Chao
    He, Yanzhang
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 310 - 316
  • [8] End-to-End Speech Recognition in Russian
    Markovnikov, Nikita
    Kipyatkova, Irina
    Lyakso, Elena
    SPEECH AND COMPUTER (SPECOM 2018), 2018, 11096 : 377 - 386
  • [9] END-TO-END MULTIMODAL SPEECH RECOGNITION
    Palaskar, Shruti
    Sanabria, Ramon
    Metze, Florian
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5774 - 5778
  • [10] Overview of end-to-end speech recognition
    Wang, Song
    Li, Guanyu
    2018 INTERNATIONAL SYMPOSIUM ON POWER ELECTRONICS AND CONTROL ENGINEERING (ISPECE 2018), 2019, 1187