Key Frame Mechanism for Efficient Conformer Based End-to-End Speech Recognition

被引:0
|
作者
Fan, Peng [1 ]
Shan, Changhao [2 ]
Sun, Sining [2 ]
Yang, Qing [2 ]
Zhang, Jianwei [3 ]
机构
[1] Sichuan Univ, Natl Key Lab Fundamental Sci Synthet Vis, Chengdu 610065, Peoples R China
[2] Du Xiaoman Financial, Beijing 100089, Peoples R China
[3] Sichuan Univ, Coll Comp Sci, Chengdu 610065, Peoples R China
关键词
Automatic speech recognition; self-attention; key frame; signal processing; drop frame;
D O I
10.1109/LSP.2023.3327585
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Recently, Conformer as a backbone network for end-to-end automatic speech recognition achieved state-of-the-art performance. The Conformer block leverages a self-attention mechanism to capture global information, along with a convolutional neural network to capture local information, resulting in improved performance. However, the Conformer-based model encounters an issue with the self-attention mechanism, as computational complexity grows quadratically with the length of the input sequence. Inspired by previous Connectionist Temporal Classification (CTC) guided blank skipping during decoding, we introduce intermediate CTC outputs as guidance into the downsampling procedure of the Conformer encoder. We define the frame with non-blank output as key frame. Specifically, we introduce the key frame-based self-attention (KFSA) mechanism, a novel method to reduce the computation of the self-attention mechanism using key frames. The structure of our proposed approach comprises two encoders. Following the initial encoder, we introduce an intermediate CTC loss function to compute the label frame, enabling us to extract the key frames and blank frames for KFSA. Furthermore, we introduce the key frame-based downsampling (KFDS) mechanism to operate on high-dimensional acoustic features directly and drop the frames corresponding to blank labels, which results in new acoustic feature sequences as input to the second encoder. By using the proposed method, which achieves comparable or higher performance than vanilla Conformer and other similar work such as Efficient Conformer. Meantime, our proposed method can discard more than 60% useless frames during model training and inference, which will accelerate the inference speed significantly.
引用
收藏
页码:1612 / 1616
页数:5
相关论文
共 50 条
  • [41] BLSTM-BASED CONFIDENCE ESTIMATION FOR END-TO-END SPEECH RECOGNITION
    Ogawa, Atsunori
    Tawara, Naohiro
    Kano, Takatomo
    Delcroix, Marc
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6383 - 6387
  • [42] Residual Energy-Based Models for End-to-End Speech Recognition
    Li, Qiujia
    Zhang, Yu
    Li, Bo
    Cao, Liangliang
    Woodland, Philip C.
    INTERSPEECH 2021, 2021, : 4069 - 4073
  • [43] Confidence-based Ensembles of End-to-End Speech Recognition Models
    Gitman, Igor
    Lavrukhin, Vitaly
    Laptev, Aleksandr
    Ginsburg, Boris
    INTERSPEECH 2023, 2023, : 1414 - 1418
  • [44] END-TO-END ATTENTION-BASED LARGE VOCABULARY SPEECH RECOGNITION
    Bandanau, Dzmitry
    Chorowski, Jan
    Serdyuk, Dmitriy
    Brakel, Philemon
    Bengio, Yoshua
    2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 4945 - 4949
  • [45] Insertion-Based Modeling for End-to-End Automatic Speech Recognition
    Fujita, Yuya
    Watanabe, Shinji
    Omachi, Motoi
    Chang, Xuankai
    INTERSPEECH 2020, 2020, : 3660 - 3664
  • [46] End-to-end speech recognition system based on improved CLDNN structure
    Feng, Yujie
    Zhang, Yi
    Xu, Xuan
    PROCEEDINGS OF 2019 IEEE 8TH JOINT INTERNATIONAL INFORMATION TECHNOLOGY AND ARTIFICIAL INTELLIGENCE CONFERENCE (ITAIC 2019), 2019, : 538 - 542
  • [47] Speaker Adaptation for Attention-Based End-to-End Speech Recognition
    Meng, Zhong
    Gaur, Yashesh
    Li, Jinyu
    Gong, Yifan
    INTERSPEECH 2019, 2019, : 241 - 245
  • [48] Large Margin Training for Attention Based End-to-End Speech Recognition
    Wang, Peidong
    Cui, Jia
    Weng, Chao
    Yu, Dong
    INTERSPEECH 2019, 2019, : 246 - 250
  • [49] ATTENTION-BASED END-TO-END SPEECH RECOGNITION ON VOICE SEARCH
    Shan, Changhao
    Zhang, Junbo
    Wang, Yujun
    Xie, Lei
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4764 - 4768
  • [50] Gaussian Prediction based Attention for Online End-to-End Speech Recognition
    Hou, Junfeng
    Zhang, Shiliang
    Dai, Lirong
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 3692 - 3696