Cross Attention with Monotonic Alignment for Speech Transformer

被引:4
|
作者
Zhao, Yingzhu [1 ,2 ,3 ]
Ni, Chongjia [2 ]
Leung, Cheung-Chi [2 ]
Joty, Shafiq [1 ]
Chng, Eng Siong [1 ]
Ma, Bin [2 ]
机构
[1] Nanyang Technol Univ, Singapore, Singapore
[2] Alibaba Grp, Machine Intelligence Technol, Hangzhou, Peoples R China
[3] Joint PhD Program Alibaba & Nanyang Technol Univ, Singapore, Singapore
来源
关键词
speech recognition; end-to-end; transformer; alignment; cross attention; HIDDEN MARKOV-MODELS;
D O I
10.21437/Interspeech.2020-1198
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Transformer, a state-of-the-art neural network architecture, has been used successfully for different sequence-to-sequence transformation tasks. This model architecture disperses the attention distribution over entire input to learn long-term dependencies, which is important for some sequence-to-sequence tasks, such as neural machine translation and text summarization. However, automatic speech recognition (ASR) has a characteristic to have monotonic alignment between text output and speech input. Techniques like Connectionist Temporal Classification (CTC), RNN Transducer (RNN-T) and Recurrent Neural Aligner (RNA) build on top of this monotonic alignment and use local encoded speech representations for corresponding token prediction. In this paper, we present an effective cross attention biasing technique in transformer that takes monotonic alignment between text output and speech input into consideration by making use of cross attention weights. Specifically, a Gaussian mask is applied on cross attention weights to limit the input speech context range locally given alignment information. We further introduce a regularizer for alignment regularization. Experiments on LibriSpeech dataset find that our proposed model can obtain improved output-input alignment for ASR, and yields 14.5%-25.0% relative word error rate (WER) reductions.
引用
收藏
页码:5031 / 5035
页数:5
相关论文
共 50 条
  • [41] Cross-Attention Transformer-Based Domain Adaptation: A Novel Method for Fault Diagnosis of Rotating Machinery With High Generalizability and Alignment Capability
    Yin, Hua
    Chen, Qitong
    Chen, Liang
    Shen, Changqing
    IEEE SENSORS JOURNAL, 2024, 24 (23) : 40049 - 40058
  • [42] Towards Efficiently Learning Monotonic Alignments for Attention-Based End-to-End Speech Recognition
    Miao, Chenfeng
    Zou, Kun
    Zhuang, Ziyang
    Wei, Tao
    Ma, Jun
    Wang, Shaojun
    Xiao, Jing
    INTERSPEECH 2022, 2022, : 1051 - 1055
  • [43] Unsupervised Domain Adaptation via Bidirectional Cross-Attention Transformer
    Wang, Xiyu
    Guo, Pengxin
    Zhang, Yu
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES: RESEARCH TRACK, ECML PKDD 2023, PT V, 2023, 14173 : 309 - 325
  • [44] Cross-Parallel Attention and Efficient Match Transformer for Aerial Tracking
    Deng, Anping
    Han, Guangliang
    Zhang, Zhongbo
    Chen, Dianbing
    Ma, Tianjiao
    Liu, Zhichao
    REMOTE SENSING, 2024, 16 (06)
  • [45] A Dual Cross Attention Transformer Network for Infrared and Visible Image Fusion
    Zhou, Zhuozhi
    Lan, Jinhui
    2024 7TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND BIG DATA, ICAIBD 2024, 2024, : 494 - 499
  • [46] Learning Cross-Attention Point Transformer With Global Porous Sampling
    Duan, Yueqi
    Sun, Haowen
    Yan, Juncheng
    Lu, Jiwen
    Zhou, Jie
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 6283 - 6297
  • [47] DCAT: Dual Cross-Attention-Based Transformer for Change Detection
    Zhou, Yuan
    Huo, Chunlei
    Zhu, Jiahang
    Huo, Leigang
    Pan, Chunhong
    REMOTE SENSING, 2023, 15 (09)
  • [48] CROSSFORMER: A VERSATILE VISION TRANSFORMER HINGING ON CROSS-SCALE ATTENTION
    Wang, Wenxiao
    Yao, Lu
    Chen, Long
    Lin, Binbin
    Cai, Deng
    He, Xiaofei
    Liu, Wei
    ICLR 2022 - 10th International Conference on Learning Representations, 2022,
  • [49] Embedded Heterogeneous Attention Transformer for Cross-Lingual Image Captioning
    Song, Zijie
    Hu, Zhenzhen
    Zhou, Yuanen
    Zhao, Ye
    Hong, Richang
    Wang, Meng
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 9008 - 9020
  • [50] Optimization-Inspired Cross-Attention Transformer for Compressive Sensing
    Song, Jiechong
    Mou, Chong
    Wang, Shiqi
    Ma, Siwei
    Zhang, Jian
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6174 - 6184