Cross Attention with Monotonic Alignment for Speech Transformer

被引：4

作者：

Zhao, Yingzhu ^{[1
,2
,3
]}

Ni, Chongjia ^{[2
]}

Leung, Cheung-Chi ^{[2
]}

Joty, Shafiq ^{[1
]}

Chng, Eng Siong ^{[1
]}

Ma, Bin ^{[2
]}

机构：

[1] Nanyang Technol Univ, Singapore, Singapore

[2] Alibaba Grp, Machine Intelligence Technol, Hangzhou, Peoples R China

[3] Joint PhD Program Alibaba & Nanyang Technol Univ, Singapore, Singapore

来源：

INTERSPEECH 2020 | 2020年

关键词：

speech recognition; end-to-end; transformer; alignment; cross attention; HIDDEN MARKOV-MODELS;

D O I：

10.21437/Interspeech.2020-1198

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Transformer, a state-of-the-art neural network architecture, has been used successfully for different sequence-to-sequence transformation tasks. This model architecture disperses the attention distribution over entire input to learn long-term dependencies, which is important for some sequence-to-sequence tasks, such as neural machine translation and text summarization. However, automatic speech recognition (ASR) has a characteristic to have monotonic alignment between text output and speech input. Techniques like Connectionist Temporal Classification (CTC), RNN Transducer (RNN-T) and Recurrent Neural Aligner (RNA) build on top of this monotonic alignment and use local encoded speech representations for corresponding token prediction. In this paper, we present an effective cross attention biasing technique in transformer that takes monotonic alignment between text output and speech input into consideration by making use of cross attention weights. Specifically, a Gaussian mask is applied on cross attention weights to limit the input speech context range locally given alignment information. We further introduce a regularizer for alignment regularization. Experiments on LibriSpeech dataset find that our proposed model can obtain improved output-input alignment for ASR, and yields 14.5%-25.0% relative word error rate (WER) reductions.

引用

页码：5031 / 5035

页数：5

共 50 条

[41] Cross-Attention Transformer-Based Domain Adaptation: A Novel Method for Fault Diagnosis of Rotating Machinery With High Generalizability and Alignment Capability
Yin, Hua
Chen, Qitong
Chen, Liang
Shen, Changqing
IEEE SENSORS JOURNAL, 2024, 24 (23) : 40049 - 40058
[42] Towards Efficiently Learning Monotonic Alignments for Attention-Based End-to-End Speech Recognition
Miao, Chenfeng
Zou, Kun
Zhuang, Ziyang
Wei, Tao
Ma, Jun
Wang, Shaojun
Xiao, Jing
INTERSPEECH 2022, 2022, : 1051 - 1055
[43] Unsupervised Domain Adaptation via Bidirectional Cross-Attention Transformer
Wang, Xiyu
Guo, Pengxin
Zhang, Yu
MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES: RESEARCH TRACK, ECML PKDD 2023, PT V, 2023, 14173 : 309 - 325
[44] Cross-Parallel Attention and Efficient Match Transformer for Aerial Tracking
Deng, Anping
Han, Guangliang
Zhang, Zhongbo
Chen, Dianbing
Ma, Tianjiao
Liu, Zhichao
REMOTE SENSING, 2024, 16 (06)
[45] A Dual Cross Attention Transformer Network for Infrared and Visible Image Fusion
Zhou, Zhuozhi
Lan, Jinhui
2024 7TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND BIG DATA, ICAIBD 2024, 2024, : 494 - 499
[46] Learning Cross-Attention Point Transformer With Global Porous Sampling
Duan, Yueqi
Sun, Haowen
Yan, Juncheng
Lu, Jiwen
Zhou, Jie
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 6283 - 6297
[47] DCAT: Dual Cross-Attention-Based Transformer for Change Detection
Zhou, Yuan
Huo, Chunlei
Zhu, Jiahang
Huo, Leigang
Pan, Chunhong
REMOTE SENSING, 2023, 15 (09)
[48] CROSSFORMER: A VERSATILE VISION TRANSFORMER HINGING ON CROSS-SCALE ATTENTION
Wang, Wenxiao
Yao, Lu
Chen, Long
Lin, Binbin
Cai, Deng
He, Xiaofei
Liu, Wei
ICLR 2022 - 10th International Conference on Learning Representations, 2022,
[49] Embedded Heterogeneous Attention Transformer for Cross-Lingual Image Captioning
Song, Zijie
Hu, Zhenzhen
Zhou, Yuanen
Zhao, Ye
Hong, Richang
Wang, Meng
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 9008 - 9020
[50] Optimization-Inspired Cross-Attention Transformer for Compressive Sensing
Song, Jiechong
Mou, Chong
Wang, Shiqi
Ma, Siwei
Zhang, Jian
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6174 - 6184

← 1 2 3 4 5 →