Cross Attention with Monotonic Alignment for Speech Transformer

被引:4
|
作者
Zhao, Yingzhu [1 ,2 ,3 ]
Ni, Chongjia [2 ]
Leung, Cheung-Chi [2 ]
Joty, Shafiq [1 ]
Chng, Eng Siong [1 ]
Ma, Bin [2 ]
机构
[1] Nanyang Technol Univ, Singapore, Singapore
[2] Alibaba Grp, Machine Intelligence Technol, Hangzhou, Peoples R China
[3] Joint PhD Program Alibaba & Nanyang Technol Univ, Singapore, Singapore
来源
关键词
speech recognition; end-to-end; transformer; alignment; cross attention; HIDDEN MARKOV-MODELS;
D O I
10.21437/Interspeech.2020-1198
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Transformer, a state-of-the-art neural network architecture, has been used successfully for different sequence-to-sequence transformation tasks. This model architecture disperses the attention distribution over entire input to learn long-term dependencies, which is important for some sequence-to-sequence tasks, such as neural machine translation and text summarization. However, automatic speech recognition (ASR) has a characteristic to have monotonic alignment between text output and speech input. Techniques like Connectionist Temporal Classification (CTC), RNN Transducer (RNN-T) and Recurrent Neural Aligner (RNA) build on top of this monotonic alignment and use local encoded speech representations for corresponding token prediction. In this paper, we present an effective cross attention biasing technique in transformer that takes monotonic alignment between text output and speech input into consideration by making use of cross attention weights. Specifically, a Gaussian mask is applied on cross attention weights to limit the input speech context range locally given alignment information. We further introduce a regularizer for alignment regularization. Experiments on LibriSpeech dataset find that our proposed model can obtain improved output-input alignment for ASR, and yields 14.5%-25.0% relative word error rate (WER) reductions.
引用
收藏
页码:5031 / 5035
页数:5
相关论文
共 50 条
  • [31] Cross-Modal Graph Attention Network for Entity Alignment
    Xu, Baogui
    Xu, Chengjin
    Su, Bing
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3715 - 3723
  • [32] Entity Alignment Based on Cross-Graph and Enhanced Attention
    Li, Jiachun
    Lu, Yiqin
    Qin, Jiancheng
    Chen, Jiarui
    Pan, Weiqiang
    2024 7TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND BIG DATA, ICAIBD 2024, 2024, : 117 - 123
  • [33] XeroAlign: Zero-Shot Cross-lingual Transformer Alignment
    Gritta, Milan
    Iacobacci, Ignacio
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 371 - 381
  • [34] Speech recognition based on the transformer's multi-head attention in Arabic
    Mahmoudi O.
    Filali-Bouami M.
    Benchat M.
    International Journal of Speech Technology, 2024, 27 (01) : 211 - 223
  • [35] Integrated visual transformer and flash attention for lip-to-speech generation GAN
    Yang, Qiong
    Bai, Yuxuan
    Liu, Feng
    Zhang, Wei
    SCIENTIFIC REPORTS, 2024, 14 (01)
  • [36] Integrated visual transformer and flash attention for lip-to-speech generation GAN
    Qiong Yang
    Yuxuan Bai
    Feng Liu
    Wei Zhang
    Scientific Reports, 14
  • [37] STAR-Transformer: A Spatio-temporal Cross Attention Transformer for Human Action Recognition
    Ahn, Dasom
    Kim, Sangwon
    Hong, Hyunsu
    Ko, Byoung Chul
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 3319 - 3328
  • [38] Alignment Knowledge Distillation for Online Streaming Attention-Based Speech Recognition
    Inaguma, Hirofumi
    Kawahara, Tatsuya
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 1371 - 1385
  • [39] Trajectory Alignment based Multi-Scaled Temporal Attention for Efficient Video Transformer
    Zhang, Zao
    Yuan, Dong
    Zhang, Yu
    Bao, Wei
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 1409 - 1414
  • [40] Multi-task Learning with Auxiliary Cross-attention Transformer for Low-Resource Multi-dialect Speech Recognition
    Dan, Zhengjia
    Zhao, Yue
    Bi, Xiaojun
    Wu, Licheng
    Ji, Qiang
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, NLPCC 2022, PT I, 2022, 13551 : 107 - 118