SimulSpeech: End-to-End Simultaneous Speech to Text Translation

被引:0
|
作者
Ren, Yi [1 ]
Liu, Jinglin [1 ]
Tan, Xu [2 ]
Zhang, Chen [1 ]
Qin, Tao [2 ]
Zhao, Zhou [1 ]
Liu, Tie-Yan [2 ]
机构
[1] Zhejiang Univ, Hangzhou, Zhejiang, Peoples R China
[2] Microsoft Res, Redmond, WA USA
基金
中国国家自然科学基金; 国家重点研发计划; 浙江省自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this work, we develop SimulSpeech, an end-to-end simultaneous speech to text translation system which translates speech in source language to text in target language concurrently. SimulSpeech consists of a speech encoder, a speech segmenter and a text decoder, where 1) the segmenter builds upon the encoder and leverages a connectionist temporal classification (CTC) loss to split the input streaming speech in real time, 2) the encoder-decoder attention adopts a wait-k strategy for simultaneous translation. SimulSpeech is more challenging than previous cascaded systems (with simultaneous automatic speech recognition (ASR) and simultaneous neural machine translation (NMT)). We introduce two novel knowledge distillation methods to ensure the performance: 1) Attention-level knowledge distillation transfers the knowledge from the multiplication of the attention matrices of simultaneous NMT and ASR models to help the training of the attention mechanism in SimulSpeech; 2) Data-level knowledge distillation transfers the knowledge from the full-sentence NMT model and also reduces the complexity of data distribution to help on the optimization of SimulSpeech. Experiments on MuST-C English-Spanish and English-German spoken language translation datasets show that SimulSpeech achieves reasonable BLEU scores and lower delay compared to full-sentence end-to-end speech to text translation (without simultaneous translation), and better performance than the two-stage cascaded simultaneous translation model in terms of BLEU scores and translation delay.
引用
收藏
页码:3787 / 3796
页数:10
相关论文
共 50 条
  • [21] End-to-end Speech-to-Punctuated-Text Recognition
    Nozaki, Jumon
    Kawahara, Tatsuya
    Ishizuka, Kenkichi
    Hashimoto, Taiichi
    INTERSPEECH 2022, 2022, : 1811 - 1815
  • [22] End-to-End Mongolian Text-to-Speech System
    Li, Jingdong
    Zhang, Hui
    Liu, Rui
    Zhang, Xueliang
    Bao, Feilong
    2018 11TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2018, : 483 - 487
  • [23] End-to-End Speech Synthesis for Bangla with Text Normalization
    Pial, Tanzir Islam
    Aunti, Shahreen Salim
    Ahmed, Shabbir
    Heickal, Hasnain
    2018 5TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE/ INTELLIGENCE AND APPLIED INFORMATICS (CSII 2018), 2018, : 66 - 71
  • [24] Improving End-to-End Text Image Translation From the Auxiliary Text Translation Task
    Ma, Cong
    Zhang, Yaping
    Tu, Mei
    Han, Xu
    Wu, Linghui
    Zhao, Yang
    Zhou, Yu
    2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 1664 - 1670
  • [25] Adaptive Feature Selection for End-to-End Speech Translation
    Zhang, Biao
    Titov, Ivan
    Haddow, Barry
    Sennrich, Rico
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 2533 - 2544
  • [26] MINTZAI: End-to-end Deep Learning for Speech Translation
    Etchegoyhen, Thierry
    Arzelus, Haritz
    Gete, Harritxu
    Alvarez, Aitor
    Hernaez, Inma
    Navas, Eva
    Gonzalez-Docasal, Ander
    Osacar, Jaime
    Benites, Edson
    Ellakuria, Igor
    Calonge, Eusebi
    Martin, Maite
    PROCESAMIENTO DEL LENGUAJE NATURAL, 2020, (65): : 97 - 100
  • [27] Speaker voice normalization for end-to-end speech translation
    Xue, Zhengshan
    Shi, Tingxun
    Zhang, Xiaolei
    Xiong, Deyi
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 248
  • [28] Self-Training for End-to-End Speech Translation
    Pino, Juan
    Xu, Qiantong
    Ma, Xutai
    Dousti, Mohammad Javad
    Tang, Yun
    INTERSPEECH 2020, 2020, : 1476 - 1480
  • [29] Listen, Understand and Translate: Triple Supervision Decouples End-to-end Speech-to-text Translation
    Dong, Qianqian
    Ye, Rong
    Wang, Mingxuan
    Zhou, Hao
    Xu, Shuang
    Xu, Bo
    Li, Lei
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 12749 - 12759
  • [30] AdaST: Dynamically Adapting Encoder States in the Decoder for End-to-End Speech-to-Text Translation
    Huang, Wuwei
    Wang, Dexin
    Xiong, Deyi
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 2539 - 2545