TRANSFORMER-BASED DIRECT SPEECH-TO-SPEECH TRANSLATION WITH TRANSCODER

被引:28
|
作者
Kano, Takatomo [1 ]
Sakti, Sakriani [1 ,2 ]
Nakamura, Satoshi [1 ,2 ]
机构
[1] Nara Inst Sci & Technol, Ikoma, Japan
[2] RIKEN, Ctr Adv Intelligence Project AIP, Wako, Saitama, Japan
来源
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT) | 2021年
关键词
speech-to-speech translation; Transcoder; Transformer; sequence-to-sequence model; multitask learning; MODELS;
D O I
10.1109/SLT48900.2021.9383496
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Traditional speech translation systems use a cascade manner that concatenates speech recognition (ASR), machine translation (MT), and text-to-speech (TTS) synthesis to translate speech from one language to another language in a step-by-step manner. Unfortunately, since those components are trained separately, MT often struggles to handle ASR errors, resulting in unnatural translation results. Recently, one work attempted to construct direct speech translation in a single model. The model used a multi-task scheme that learns to predict not only the target speech spectrograms directly but also the source and target phoneme transcription as auxiliary tasks. However, that work was only evaluated Spanish-English language pairs with similar syntax and word order. With syntactically distant language pairs, speech translation requires distant word order, and thus direct speech frame-to-frame alignments become difficult. Another direction was to construct a single deep-learning framework while keeping the step-by-step translation process. However, such studies focused only on speech-to-text translation. Furthermore, all of these works were based on a recurrent neural network (RNN) model. In this work, we propose a step-by-step scheme to a complete end-to-end speech-to-speech translation and propose a Transformer-based speech translation using Transcoder. We compare our proposed and multi-task model using syntactically similar and distant language pairs.
引用
收藏
页码:958 / 965
页数:8
相关论文
共 50 条
  • [41] NESTING HIERARCHICAL PHRASE-BASED MODEL FOR SPEECH-TO-SPEECH TRANSLATION
    Fu, Xiaoyin
    Wei, Wei
    Fan, Lichun
    Lu, Shixiang
    Xu, Bo
    2012 8TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, 2012, : 368 - 372
  • [42] Class-based Statistical Machine Translation for Field Maintainable Speech-To-Speech Translation
    Lane, Ian R.
    Waibel, Alex
    INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 2362 - 2365
  • [43] Translatotron 2: High-quality direct speech-to-speech translation with voice preservation
    Jia, Ye
    Ramanovich, Michelle Tadmor
    Remez, Tal
    Pomerantz, Roi
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022, : 10120 - 10134
  • [44] Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation
    Jia, Ye
    Ding, Yifan
    Bapna, Ankur
    Cherry, Colin
    Zhang, Yu
    Conneau, Alexis
    Morioka, Nobuyuki
    INTERSPEECH 2022, 2022, : 1721 - 1725
  • [45] A transformer-based network for speech recognition
    Tang L.
    International Journal of Speech Technology, 2023, 26 (02) : 531 - 539
  • [46] Speech-to-speech translation software on PDAs for travel conversation
    Isotani, Ryosuke
    Yamabana, Kiyoshi
    Ando, Shinichi
    Hanazawa, Ken
    Ishikawa, Shin-Ya
    Iso, Ken-Ichi
    NEC Research and Development, 2003, 44 (SPEC.): : 197 - 202
  • [47] CVSS Corpus and Massively Multilingual Speech-to-Speech Translation
    Jia, Ye
    Ramanovich, Michelle Tadmor
    Wang, Quan
    Zen, Heiga
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6691 - 6703
  • [48] A hand-held speech-to-speech translation system
    Zhou, BW
    Gao, YQ
    Sorensen, J
    Déchelotte, D
    Picheny, M
    ASRU'03: 2003 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING ASRU '03, 2003, : 664 - 669
  • [49] SPEECH-TO-SPEECH TRANSLATION BETWEEN UNTRANSCRIBED UNKNOWN LANGUAGES
    Tjandra, Andros
    Sakti, Sakriani
    Nakamura, Satoshi
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 593 - 600
  • [50] Predicting dialogue acts for a speech-to-speech translation system
    Reithinger, N
    Engel, R
    Kipp, M
    Klesen, M
    ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, 1996, : 654 - 657