ESAformer: Enhanced Self-Attention for Automatic Speech Recognition

被引:3
|
作者
Li, Junhua [1 ]
Duan, Zhikui [1 ]
Li, Shiren [2 ]
Yu, Xinmei [1 ]
Yang, Guangguang [1 ]
机构
[1] Foshan Univ, Foshan 528000, Peoples R China
[2] Sun Yat Sen Univ, Guangzhou 510275, Peoples R China
关键词
Feature extraction; Transformers; Convolution; Logic gates; Testing; Tensors; Training; Speech recognition; transformer; enhanced self-attention; multi-order interaction; TRANSFORMER;
D O I
10.1109/LSP.2024.3358754
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
In this letter, an Enhanced Self-Attention (ESA) module has been put forward for feature extraction. The proposed ESA is integrated with the recursive gated convolution and self-attention mechanism. In particular, the former is used to capture multi-order feature interaction and the latter is for global feature extraction. In addition, the location of interest that is suitable for inserting the ESA is also worth being explored. In this letter, the ESA is embedded into the encoder layer of the Transformer network for automatic speech recognition (ASR) tasks, and this newly proposed model is named ESAformer. The effectiveness of the ESAformer has been validated using three datasets, that are Aishell-1, HKUST and WSJ. Experimental results show that, compared with the Transformer network, 0.8% CER, 1.2% CER and 0.7%/0.4% WER, improvement for these three mentioned datasets, respectively, can be achieved.
引用
收藏
页码:471 / 475
页数:5
相关论文
共 50 条
  • [1] ON THE USEFULNESS OF SELF-ATTENTION FOR AUTOMATIC SPEECH RECOGNITION WITH TRANSFORMERS
    Zhang, Shucong
    Loweimi, Erfan
    Bell, Peter
    Renals, Steve
    2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 89 - 96
  • [2] Self-attention for Speech Emotion Recognition
    Tarantino, Lorenzo
    Garner, Philip N.
    Lazaridis, Alexandros
    INTERSPEECH 2019, 2019, : 2578 - 2582
  • [3] Analysis of Self-Attention Head Diversity for Conformer-based Automatic Speech Recognition
    Audhkhasi, Kartik
    Huang, Yinghui
    Ramabhadran, Bhuvana
    Moreno, Pedro J.
    INTERSPEECH 2022, 2022, : 1026 - 1030
  • [4] Self-attention transfer networks for speech emotion recognition
    Ziping ZHAO
    Keru Wang
    Zhongtian BAO
    Zixing ZHANG
    Nicholas CUMMINS
    Shihuang SUN
    Haishuai WANG
    Jianhua TAO
    Bj?rn W.SCHULLER
    虚拟现实与智能硬件(中英文), 2021, 3 (01) : 43 - 54
  • [5] Multilingual Speech Recognition with Self-Attention Structured Parameterization
    Zhu, Yun
    Haghani, Parisa
    Tripathi, Anshuman
    Ramabhadran, Bhuvana
    Farris, Brian
    Xu, Hainan
    Lu, Han
    Sak, Hasim
    Leal, Isabel
    Gaur, Neeraj
    Moreno, Pedro J.
    Zhang, Qian
    INTERSPEECH 2020, 2020, : 4741 - 4745
  • [6] Multi-Stride Self-Attention for Speech Recognition
    Han, Kyu J.
    Huang, Jing
    Tang, Yun
    He, Xiaodong
    Zhou, Bowen
    INTERSPEECH 2019, 2019, : 2788 - 2792
  • [7] NEPALI SPEECH RECOGNITION USING SELF-ATTENTION NETWORKS
    Joshi, Basanta
    Shrestha, Rupesh
    INTERNATIONAL JOURNAL OF INNOVATIVE COMPUTING INFORMATION AND CONTROL, 2023, 19 (06): : 1769 - 1784
  • [8] Self-Attention Transducers for End-to-End Speech Recognition
    Tian, Zhengkun
    Yi, Jiangyan
    Tao, Jianhua
    Bai, Ye
    Wen, Zhengqi
    INTERSPEECH 2019, 2019, : 4395 - 4399
  • [9] BAT: Block and token self-attention for speech emotion recognition
    Lei, Jianjun
    Zhu, Xiangwei
    Wang, Ying
    Neural Networks, 2022, 156 : 67 - 80
  • [10] BAT: Block and token self-attention for speech emotion recognition
    Lei, Jianjun
    Zhu, Xiangwei
    Wang, Ying
    NEURAL NETWORKS, 2022, 156 : 67 - 80