END-TO-END SPEECH SUMMARIZATION USING RESTRICTED SELF-ATTENTION

被引：8

作者：

Sharma, Roshan ^{[1
]}

Palaskar, Shruti ^{[1
]}

Black, Alan W. ^{[1
]}

Metze, Florian ^{[1
]}

机构：

[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA

来源：

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年

关键词：

speech summarization; end-to-end; long sequence modeling; concept learning;

D O I：

10.1109/ICASSP43922.2022.9747320

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Speech summarization is typically performed by using a cascade of speech recognition and text summarization models. End-to-end modeling of speech summarization models is challenging due to memory and compute constraints arising from long input audio sequences. Recent work in document summarization has inspired methods to reduce the complexity of self-attentions, which enables transformer models to handle long sequences. In this work, we introduce a single model optimized end-to-end for speech summarization. We apply the restricted self-attention technique from text-based models to speech models to address the memory and compute constraints. We demonstrate that the proposed model learns to directly summarize speech for the How-2 corpus of instructional videos. The proposed end-to-end model outperforms the previously proposed cascaded model by 3 points absolute on ROUGE. Further, we consider the spoken language understanding task of predicting concepts from speech inputs and show that the proposed end-to-end model outperforms the cascade model by 4 points absolute F-1.

引用

页码：8072 / 8076

页数：5

共 50 条

[1] Self-Attention Transducers for End-to-End Speech Recognition
Tian, Zhengkun
Yi, Jiangyan
Tao, Jianhua
Bai, Ye
Wen, Zhengqi
INTERSPEECH 2019, 2019, : 4395 - 4399
[2] Efficient decoding self-attention for end-to-end speech synthesis
Zhao, Wei
Xu, Li
FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2022, 23 (07) : 1127 - 1138
[3] On the localness modeling for the self-attention based end-to-end speech synthesis
Yang, Shan
Lu, Heng
Kang, Shiyin
Xue, Liumeng
Xiao, Jinba
Su, Dan
Xie, Lei
Yu, Dong
Neural Networks, 2020, 125 : 121 - 130
[4] Very Deep Self-Attention Networks for End-to-End Speech Recognition
Ngoc-Quan Pham
Thai-Son Nguyen
Niehues, Jan
Mueller, Markus
Waibel, Alex
INTERSPEECH 2019, 2019, : 66 - 70
[5] On the localness modeling for the self-attention based end-to-end speech synthesis
Yang, Shan
Lu, Heng
Kang, Shiyin
Xue, Liumeng
Xiao, Jinba
Su, Dan
Xie, Lei
Yu, Dong
NEURAL NETWORKS, 2020, 125 : 121 - 130
[6] Improving Hybrid CTC/Attention Architecture with Time-Restricted Self-Attention CTC for End-to-End Speech Recognition
Wu, Long
Li, Ta
Wang, Li
Yan, Yonghong
APPLIED SCIENCES-BASEL, 2019, 9 (21):
[7] SIMPLIFIED SELF-ATTENTION FOR TRANSFORMER-BASED END-TO-END SPEECH RECOGNITION
Luo, Haoneng
Zhang, Shiliang
Lei, Ming
Xie, Lei
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 75 - 81
[8] End-to-End ASR with Adaptive Span Self-Attention
Chang, Xuankai
Subramanian, Aswin Shanmugam
Guo, Pengcheng
Watanabe, Shinji
Fujita, Yuya
Omachi, Motoi
INTERSPEECH 2020, 2020, : 3595 - 3599
[9] END-TO-END NEURAL SPEAKER DIARIZATION WITH SELF-ATTENTION
Fujita, Yusuke
Kanda, Naoyuki
Horiguchi, Shota
Xue, Yawen
Nagamatsu, Kenji
Watanabe, Shinji
2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 296 - 303
[10] Adversarial joint training with self-attention mechanism for robust end-to-end speech recognition
Li, Lujun
Kang, Yikai
Shi, Yuchen
Kurzinger, Ludwig
Watzel, Tobias
Rigoll, Gerhard
EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2021, 2021 (01)

← 1 2 3 4 5 →