Cross Attention with Monotonic Alignment for Speech Transformer

被引:4
|
作者
Zhao, Yingzhu [1 ,2 ,3 ]
Ni, Chongjia [2 ]
Leung, Cheung-Chi [2 ]
Joty, Shafiq [1 ]
Chng, Eng Siong [1 ]
Ma, Bin [2 ]
机构
[1] Nanyang Technol Univ, Singapore, Singapore
[2] Alibaba Grp, Machine Intelligence Technol, Hangzhou, Peoples R China
[3] Joint PhD Program Alibaba & Nanyang Technol Univ, Singapore, Singapore
来源
关键词
speech recognition; end-to-end; transformer; alignment; cross attention; HIDDEN MARKOV-MODELS;
D O I
10.21437/Interspeech.2020-1198
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Transformer, a state-of-the-art neural network architecture, has been used successfully for different sequence-to-sequence transformation tasks. This model architecture disperses the attention distribution over entire input to learn long-term dependencies, which is important for some sequence-to-sequence tasks, such as neural machine translation and text summarization. However, automatic speech recognition (ASR) has a characteristic to have monotonic alignment between text output and speech input. Techniques like Connectionist Temporal Classification (CTC), RNN Transducer (RNN-T) and Recurrent Neural Aligner (RNA) build on top of this monotonic alignment and use local encoded speech representations for corresponding token prediction. In this paper, we present an effective cross attention biasing technique in transformer that takes monotonic alignment between text output and speech input into consideration by making use of cross attention weights. Specifically, a Gaussian mask is applied on cross attention weights to limit the input speech context range locally given alignment information. We further introduce a regularizer for alignment regularization. Experiments on LibriSpeech dataset find that our proposed model can obtain improved output-input alignment for ASR, and yields 14.5%-25.0% relative word error rate (WER) reductions.
引用
收藏
页码:5031 / 5035
页数:5
相关论文
共 50 条
  • [21] Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search
    Kim, Jaehyeon
    Kim, Sungwon
    Kong, Jungil
    Yoon, Sungroh
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [22] TRANSFORMER-BASED TEXT-TO-SPEECH WITH WEIGHTED FORCED ATTENTION
    Okamoto, Takuma
    Toda, Tomoki
    Shiga, Yoshinori
    Kawai, Hisashi
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6729 - 6733
  • [23] Time domain speech enhancement with CNN and time-attention transformer
    Saleem, Nasir
    Gunawan, Teddy Surya
    Dhahbi, Sami
    Bourouis, Sami
    DIGITAL SIGNAL PROCESSING, 2024, 147
  • [24] EEG-Transformer: Self-attention from Transformer Architecture for Decoding EEG of Imagined Speech
    Lee, Young-Eun
    Lee, Seo-Hyun
    10TH INTERNATIONAL WINTER CONFERENCE ON BRAIN-COMPUTER INTERFACE (BCI2022), 2022,
  • [25] WaveNet With Cross-Attention for Audiovisual Speech Recognition
    Wang, Hui
    Gao, Fei
    Zhao, Yue
    Wu, Licheng
    IEEE ACCESS, 2020, 8 : 169160 - 169168
  • [26] Few Shot Medical Image Segmentation with Cross Attention Transformer
    Lin, Yi
    Chen, Yufan
    Cheng, Kwang-Ting
    Chen, Hao
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT II, 2023, 14221 : 233 - 243
  • [27] Deformable Cross-Attention Transformer for Medical Image Registration
    Chen, Junyu
    Liu, Yihao
    He, Yufan
    Du, Yong
    MACHINE LEARNING IN MEDICAL IMAGING, MLMI 2023, PT I, 2024, 14348 : 115 - 125
  • [28] Integration of morphological features and contextual weightage using monotonic chunk attention for part of speech tagging
    Mundotiya, Rajesh Kumar
    Mehta, Arpit
    Baruah, Rupjyoti
    Singh, Anil Kumar
    JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2022, 34 (09) : 7324 - 7334
  • [29] Sequence-to-Sequence Acoustic Modeling with Semi-Stepwise Monotonic Attention for Speech Synthesis
    Zhou, Xiao
    Ling, Zhenhua
    Hu, Yajun
    Dai, Lirong
    APPLIED SCIENCES-BASEL, 2021, 11 (21):
  • [30] EXPLICIT ALIGNMENT OF TEXT AND SPEECH ENCODINGS FOR ATTENTION-BASED END-TO-END SPEECH RECOGNITION
    Drexler, Jennifer
    Glass, James
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 913 - 919