Cross Attention with Monotonic Alignment for Speech Transformer

被引:4
|
作者
Zhao, Yingzhu [1 ,2 ,3 ]
Ni, Chongjia [2 ]
Leung, Cheung-Chi [2 ]
Joty, Shafiq [1 ]
Chng, Eng Siong [1 ]
Ma, Bin [2 ]
机构
[1] Nanyang Technol Univ, Singapore, Singapore
[2] Alibaba Grp, Machine Intelligence Technol, Hangzhou, Peoples R China
[3] Joint PhD Program Alibaba & Nanyang Technol Univ, Singapore, Singapore
来源
关键词
speech recognition; end-to-end; transformer; alignment; cross attention; HIDDEN MARKOV-MODELS;
D O I
10.21437/Interspeech.2020-1198
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Transformer, a state-of-the-art neural network architecture, has been used successfully for different sequence-to-sequence transformation tasks. This model architecture disperses the attention distribution over entire input to learn long-term dependencies, which is important for some sequence-to-sequence tasks, such as neural machine translation and text summarization. However, automatic speech recognition (ASR) has a characteristic to have monotonic alignment between text output and speech input. Techniques like Connectionist Temporal Classification (CTC), RNN Transducer (RNN-T) and Recurrent Neural Aligner (RNA) build on top of this monotonic alignment and use local encoded speech representations for corresponding token prediction. In this paper, we present an effective cross attention biasing technique in transformer that takes monotonic alignment between text output and speech input into consideration by making use of cross attention weights. Specifically, a Gaussian mask is applied on cross attention weights to limit the input speech context range locally given alignment information. We further introduce a regularizer for alignment regularization. Experiments on LibriSpeech dataset find that our proposed model can obtain improved output-input alignment for ASR, and yields 14.5%-25.0% relative word error rate (WER) reductions.
引用
收藏
页码:5031 / 5035
页数:5
相关论文
共 50 条
  • [1] Adaptive Sparse and Monotonic Attention for Transformer-based Automatic Speech Recognition
    Zhao, Chendong
    Wang, Jianzong
    Wei, Wenqi
    Qu, Xiaoyang
    Wang, Haoqian
    Xiao, Jing
    2022 IEEE 9TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND ADVANCED ANALYTICS (DSAA), 2022, : 173 - 180
  • [2] MONOTONIC SEGMENTAL ATTENTION FOR AUTOMATIC SPEECH RECOGNITION
    Zeyer, Albert
    Schmitt, Robin
    Zhou, Wei
    Schlueter, Ralf
    Ney, Hermann
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 229 - 236
  • [3] Monotonic Gaussian regularization of attention for robust automatic speech recognition
    Du, Yeqian
    Wu, Minghui
    Fang, Xin
    Yang, Zhouwang
    COMPUTER SPEECH AND LANGUAGE, 2023, 77
  • [4] CASwin Transformer: A Hierarchical Cross Attention Transformer for Depth Completion
    Feng, Chunyu
    Wang, Xiaonian
    Zhang, Yangyang
    Zhao, Chengfeng
    Song, Mengxuan
    2022 IEEE 25TH INTERNATIONAL CONFERENCE ON INTELLIGENT TRANSPORTATION SYSTEMS (ITSC), 2022, : 2836 - 2841
  • [5] A window attention based Transformer for Automatic Speech Recognition
    Feng, Zhao
    Li, Yongming
    2024 5TH INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING AND APPLICATION, ICCEA 2024, 2024, : 449 - 454
  • [6] RAT: RNN-Attention Transformer for Speech Enhancement
    Zhang, Tailong
    He, Shulin
    Li, Hao
    Zhang, Xueliang
    2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2022, : 463 - 467
  • [7] Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment
    Neekhara, Paarth
    Hussain, Shehzeen
    Ghosh, Subhankar
    Li, Jason
    Ginsburg, Boris
    INTERSPEECH 2024, 2024, : 3425 - 3429
  • [8] Cross-Attention Transformer for Video Interpolation
    Kim, Hannah Halin
    Yu, Shuzhi
    Yuan, Shuai
    Tomasi, Carlo
    COMPUTER VISION - ACCV 2022 WORKSHOPS, 2023, 13848 : 325 - 342
  • [9] Cross on Cross Attention: Deep Fusion Transformer for Image Captioning
    Zhang, Jing
    Xie, Yingshuai
    Ding, Weichao
    Wang, Zhe
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (08) : 4257 - 4268
  • [10] LAS-Transformer: An Enhanced Transformer Based on the Local Attention Mechanism for Speech Recognition
    Fu, Pengbin
    Liu, Daxing
    Yang, Huirong
    INFORMATION, 2022, 13 (05)