END-TO-END SPEECH SUMMARIZATION USING RESTRICTED SELF-ATTENTION

被引:8
|
作者
Sharma, Roshan [1 ]
Palaskar, Shruti [1 ]
Black, Alan W. [1 ]
Metze, Florian [1 ]
机构
[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
关键词
speech summarization; end-to-end; long sequence modeling; concept learning;
D O I
10.1109/ICASSP43922.2022.9747320
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speech summarization is typically performed by using a cascade of speech recognition and text summarization models. End-to-end modeling of speech summarization models is challenging due to memory and compute constraints arising from long input audio sequences. Recent work in document summarization has inspired methods to reduce the complexity of self-attentions, which enables transformer models to handle long sequences. In this work, we introduce a single model optimized end-to-end for speech summarization. We apply the restricted self-attention technique from text-based models to speech models to address the memory and compute constraints. We demonstrate that the proposed model learns to directly summarize speech for the How-2 corpus of instructional videos. The proposed end-to-end model outperforms the previously proposed cascaded model by 3 points absolute on ROUGE. Further, we consider the spoken language understanding task of predicting concepts from speech inputs and show that the proposed end-to-end model outperforms the cascade model by 4 points absolute F-1.
引用
收藏
页码:8072 / 8076
页数:5
相关论文
共 50 条
  • [1] Self-Attention Transducers for End-to-End Speech Recognition
    Tian, Zhengkun
    Yi, Jiangyan
    Tao, Jianhua
    Bai, Ye
    Wen, Zhengqi
    INTERSPEECH 2019, 2019, : 4395 - 4399
  • [2] Efficient decoding self-attention for end-to-end speech synthesis
    Zhao, Wei
    Xu, Li
    FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2022, 23 (07) : 1127 - 1138
  • [3] On the localness modeling for the self-attention based end-to-end speech synthesis
    Yang, Shan
    Lu, Heng
    Kang, Shiyin
    Xue, Liumeng
    Xiao, Jinba
    Su, Dan
    Xie, Lei
    Yu, Dong
    Neural Networks, 2020, 125 : 121 - 130
  • [4] Very Deep Self-Attention Networks for End-to-End Speech Recognition
    Ngoc-Quan Pham
    Thai-Son Nguyen
    Niehues, Jan
    Mueller, Markus
    Waibel, Alex
    INTERSPEECH 2019, 2019, : 66 - 70
  • [5] On the localness modeling for the self-attention based end-to-end speech synthesis
    Yang, Shan
    Lu, Heng
    Kang, Shiyin
    Xue, Liumeng
    Xiao, Jinba
    Su, Dan
    Xie, Lei
    Yu, Dong
    NEURAL NETWORKS, 2020, 125 : 121 - 130
  • [6] Improving Hybrid CTC/Attention Architecture with Time-Restricted Self-Attention CTC for End-to-End Speech Recognition
    Wu, Long
    Li, Ta
    Wang, Li
    Yan, Yonghong
    APPLIED SCIENCES-BASEL, 2019, 9 (21):
  • [7] SIMPLIFIED SELF-ATTENTION FOR TRANSFORMER-BASED END-TO-END SPEECH RECOGNITION
    Luo, Haoneng
    Zhang, Shiliang
    Lei, Ming
    Xie, Lei
    2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 75 - 81
  • [8] End-to-End ASR with Adaptive Span Self-Attention
    Chang, Xuankai
    Subramanian, Aswin Shanmugam
    Guo, Pengcheng
    Watanabe, Shinji
    Fujita, Yuya
    Omachi, Motoi
    INTERSPEECH 2020, 2020, : 3595 - 3599
  • [9] END-TO-END NEURAL SPEAKER DIARIZATION WITH SELF-ATTENTION
    Fujita, Yusuke
    Kanda, Naoyuki
    Horiguchi, Shota
    Xue, Yawen
    Nagamatsu, Kenji
    Watanabe, Shinji
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 296 - 303
  • [10] Adversarial joint training with self-attention mechanism for robust end-to-end speech recognition
    Li, Lujun
    Kang, Yikai
    Shi, Yuchen
    Kurzinger, Ludwig
    Watzel, Tobias
    Rigoll, Gerhard
    EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2021, 2021 (01)