GAUSSIAN KERNELIZED SELF-ATTENTION FOR LONG SEQUENCE DATA AND ITS APPLICATION TO CTC-BASED SPEECH RECOGNITION

被引:4
|
作者
Kashiwagi, Yosuke [1 ]
Tsunoo, Emiru [1 ]
Watanabe, Shinji [2 ]
机构
[1] Sony Corp, Tokyo, Japan
[2] Johns Hopkins Univ, Baltimore, MD 21218 USA
关键词
speech recognition; end-to-end; self-attention; long sequence data;
D O I
10.1109/ICASSP39728.2021.9413493
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Self-attention (SA) based models have recently achieved significant performance improvements in hybrid and end-to-end automatic speech recognition (ASR) systems owing to their flexible context modeling capability. However, it is also known that the accuracy degrades when applying SA to long sequence data. This is mainly due to the length mismatch between the inference and training data because the training data are usually divided into short segments for efficient training. To mitigate this mismatch, we propose a new architecture, which is a variant of the Gaussian kernel, which itself is a shift-invariant kernel. First, we mathematically demonstrate that self-attention with shared weight parameters for queries and keys is equivalent to a normalized kernel function. By replacing this kernel function with the proposed Gaussian kernel, the architecture becomes completely shift-invariant with the relative position information embedded using a frame indexing technique. The proposed Gaussian kernelized SA was applied to connectionist temporal classification (CTC) based ASR. An experimental evaluation with the Corpus of Spontaneous Japanese (CSJ) and TEDLIUM 3 benchmarks shows that the proposed SA achieves a significant improvement in accuracy (e.g., from 24.0% WER to 6.0% in CSJ) in long sequence data without any windowing techniques.
引用
收藏
页码:6214 / 6218
页数:5
相关论文
共 25 条
  • [21] Distant, Multichannel Speech Recognition Using Microphone Array Coding and Cloud-based Beamforming with a Self-attention Channel Combinator
    Sharma, Dushyant
    Jones, Daniel
    Kruchinin, Stanislav
    Gong, Rong
    Naylor, Patrick A.
    FIFTY-SEVENTH ASILOMAR CONFERENCE ON SIGNALS, SYSTEMS & COMPUTERS, IEEECONF, 2023, : 1415 - 1419
  • [22] SADLN: Self-attention based deep learning network of integrating multi-omics data for cancer subtype recognition
    Sun, Qiuwen
    Cheng, Lei
    Meng, Ao
    Ge, Shuguang
    Chen, Jie
    Zhang, Longzhen
    Gong, Ping
    FRONTIERS IN GENETICS, 2023, 13
  • [23] A novel time-frequency Transformer based on self-attention mechanism and its application in fault diagnosis of rolling bearings
    Ding, Yifei
    Jia, Minping
    Miao, Qiuhua
    Cao, Yudong
    MECHANICAL SYSTEMS AND SIGNAL PROCESSING, 2022, 168
  • [24] Depression Detection Based on Hybrid Deep Learning SSCL Framework Using Self-Attention Mechanism: An Application to Social Networking Data
    Nadeem, Aleena
    Naveed, Muhammad
    Satti, Muhammad Islam
    Afzal, Hammad
    Ahmad, Tanveer
    Kim, Ki-Il
    SENSORS, 2022, 22 (24)
  • [25] Pornographic and Gambling Domain Recognition Method based on Long Distance Spare Multi-Head Self-Attention Vision-and-Language Model
    Wang, Luheng
    Zhang, Zhaoxin
    Ye, Feng
    2022 IEEE International Conference on Advances in Electrical Engineering and Computer Applications, AEECA 2022, 2022, : 504 - 508