ESAformer: Enhanced Self-Attention for Automatic Speech Recognition

被引:3
|
作者
Li, Junhua [1 ]
Duan, Zhikui [1 ]
Li, Shiren [2 ]
Yu, Xinmei [1 ]
Yang, Guangguang [1 ]
机构
[1] Foshan Univ, Foshan 528000, Peoples R China
[2] Sun Yat Sen Univ, Guangzhou 510275, Peoples R China
关键词
Feature extraction; Transformers; Convolution; Logic gates; Testing; Tensors; Training; Speech recognition; transformer; enhanced self-attention; multi-order interaction; TRANSFORMER;
D O I
10.1109/LSP.2024.3358754
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
In this letter, an Enhanced Self-Attention (ESA) module has been put forward for feature extraction. The proposed ESA is integrated with the recursive gated convolution and self-attention mechanism. In particular, the former is used to capture multi-order feature interaction and the latter is for global feature extraction. In addition, the location of interest that is suitable for inserting the ESA is also worth being explored. In this letter, the ESA is embedded into the encoder layer of the Transformer network for automatic speech recognition (ASR) tasks, and this newly proposed model is named ESAformer. The effectiveness of the ESAformer has been validated using three datasets, that are Aishell-1, HKUST and WSJ. Experimental results show that, compared with the Transformer network, 0.8% CER, 1.2% CER and 0.7%/0.4% WER, improvement for these three mentioned datasets, respectively, can be achieved.
引用
收藏
页码:471 / 475
页数:5
相关论文
共 50 条
  • [11] SELF-ATTENTION NETWORKS FOR CONNECTIONIST TEMPORAL CLASSIFICATION IN SPEECH RECOGNITION
    Salazar, Julian
    Kirchhoff, Katrin
    Huang, Zhiheng
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 7115 - 7119
  • [12] MULTIMODAL CROSS- AND SELF-ATTENTION NETWORK FOR SPEECH EMOTION RECOGNITION
    Sun, Licai
    Liu, Bin
    Tao, Jianhua
    Lian, Zheng
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 4275 - 4279
  • [13] A Fast Convolutional Self-attention Based Speech Dereverberation Method for Robust Speech Recognition
    Li, Nan
    Ge, Meng
    Wang, Longbiao
    Dang, Jianwu
    NEURAL INFORMATION PROCESSING (ICONIP 2019), PT III, 2019, 11955 : 295 - 305
  • [14] A Static Sign Language Recognition Method Enhanced with Self-Attention Mechanisms
    Wang, Yongxin
    Jiang, He
    Sun, Yutong
    Xu, Longqi
    SENSORS, 2024, 24 (21)
  • [15] Combining Part-of-Speech Tags and Self-Attention Mechanism for Simile Recognition
    Zhang, Pengfei
    Cai, Yi
    Chen, Junying
    Chen, Wenhao
    Song, Hengjie
    IEEE ACCESS, 2019, 7 : 163864 - 163876
  • [16] Very Deep Self-Attention Networks for End-to-End Speech Recognition
    Ngoc-Quan Pham
    Thai-Son Nguyen
    Niehues, Jan
    Mueller, Markus
    Waibel, Alex
    INTERSPEECH 2019, 2019, : 66 - 70
  • [17] Speech emotion recognition using recurrent neural networks with directional self-attention
    Li, Dongdong
    Liu, Jinlin
    Yang, Zhuo
    Sun, Linyu
    Wang, Zhe
    EXPERT SYSTEMS WITH APPLICATIONS, 2021, 173
  • [18] Combining Gated Convolutional Networks and Self-Attention Mechanism for Speech Emotion Recognition
    Li, Chao
    Jiao, Jinlong
    Zhao, Yiqin
    Zhao, Ziping
    2019 8TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION WORKSHOPS AND DEMOS (ACIIW), 2019, : 105 - 109
  • [19] SummaryMixing: A Linear-Complexity Alternative to Self-Attention for Speech Recognition and Understanding
    Parcollet, Titouan
    van Dalen, Rogier
    Zhang, Shucong
    Bhattacharya, Sourav
    INTERSPEECH 2024, 2024, : 3460 - 3464
  • [20] SPEECH DENOISING IN THE WAVEFORM DOMAIN WITH SELF-ATTENTION
    Kong, Zhifeng
    Ping, Wei
    Dantrey, Ambrish
    Catanzaro, Bryan
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7867 - 7871