GAUSSIAN KERNELIZED SELF-ATTENTION FOR LONG SEQUENCE DATA AND ITS APPLICATION TO CTC-BASED SPEECH RECOGNITION

被引:4
|
作者
Kashiwagi, Yosuke [1 ]
Tsunoo, Emiru [1 ]
Watanabe, Shinji [2 ]
机构
[1] Sony Corp, Tokyo, Japan
[2] Johns Hopkins Univ, Baltimore, MD 21218 USA
关键词
speech recognition; end-to-end; self-attention; long sequence data;
D O I
10.1109/ICASSP39728.2021.9413493
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Self-attention (SA) based models have recently achieved significant performance improvements in hybrid and end-to-end automatic speech recognition (ASR) systems owing to their flexible context modeling capability. However, it is also known that the accuracy degrades when applying SA to long sequence data. This is mainly due to the length mismatch between the inference and training data because the training data are usually divided into short segments for efficient training. To mitigate this mismatch, we propose a new architecture, which is a variant of the Gaussian kernel, which itself is a shift-invariant kernel. First, we mathematically demonstrate that self-attention with shared weight parameters for queries and keys is equivalent to a normalized kernel function. By replacing this kernel function with the proposed Gaussian kernel, the architecture becomes completely shift-invariant with the relative position information embedded using a frame indexing technique. The proposed Gaussian kernelized SA was applied to connectionist temporal classification (CTC) based ASR. An experimental evaluation with the Corpus of Spontaneous Japanese (CSJ) and TEDLIUM 3 benchmarks shows that the proposed SA achieves a significant improvement in accuracy (e.g., from 24.0% WER to 6.0% in CSJ) in long sequence data without any windowing techniques.
引用
收藏
页码:6214 / 6218
页数:5
相关论文
共 25 条
  • [1] Combining a parallel 2D CNN with a self-attention Dilated Residual Network for CTC-based discrete speech emotion recognition
    Zhao, Ziping
    Li, Qifei
    Zhang, Zixing
    Cummins, Nicholas
    Wang, Haishuai
    Tao, Jianhua
    Schuller, Bjoern W.
    NEURAL NETWORKS, 2021, 141 : 52 - 60
  • [2] INTERDECODER: USING ATTENTION DECODERS AS INTERMEDIATE REGULARIZATION FOR CTC-BASED SPEECH RECOGNITION
    Komatsu, Tatsuya
    Fujita, Yusuke
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 46 - 51
  • [3] Bidirectional Temporal Convolution with Self-Attention Network for CTC-Based Acoustic Modeling
    Sun, Jian
    Guo, Wu
    Gu, Bin
    Liu, Yao
    2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 1262 - 1266
  • [4] ATTENTION-BASED GATED SCALING ADAPTIVE ACOUSTIC MODEL FOR CTC-BASED SPEECH RECOGNITION
    Ding, Fenglin
    Guo, Wu
    Dai, Lirong
    Du, Jun
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7404 - 7408
  • [5] Improving Hybrid CTC/Attention Architecture with Time-Restricted Self-Attention CTC for End-to-End Speech Recognition
    Wu, Long
    Li, Ta
    Wang, Li
    Yan, Yonghong
    APPLIED SCIENCES-BASEL, 2019, 9 (21):
  • [6] Transformer-based end-to-end speech recognition with residual Gaussian-based self-attention
    Liang, Chengdong
    Xu, Menglong
    Zhang, Xiao-Lei
    Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2021, 2 : 1495 - 1499
  • [7] Sequence-to-Sequence Learning for Deep Gaussian Process Based Speech Synthesis Using Self-Attention GP Layer
    Nakamura, Taiki
    Koriyama, Tomoki
    Saruwatari, Hiroshi
    INTERSPEECH 2021, 2021, : 121 - 125
  • [8] A Fast Convolutional Self-attention Based Speech Dereverberation Method for Robust Speech Recognition
    Li, Nan
    Ge, Meng
    Wang, Longbiao
    Dang, Jianwu
    NEURAL INFORMATION PROCESSING (ICONIP 2019), PT III, 2019, 11955 : 295 - 305
  • [9] Long-Tailed Recognition Based on Self-attention Mechanism
    Feng, Zekai
    Jia, Hong
    Li, Mengke
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT II, ICIC 2024, 2024, 14876 : 380 - 391
  • [10] Gabor Log-Euclidean Gaussian and its fusion with deep network based on self-attention for face recognition
    Li, Chaorong
    Huang, Wei
    Huang, Yuanyuan
    APPLIED SOFT COMPUTING, 2022, 116