TRANSFORMER-BASED DIRECT SPEECH-TO-SPEECH TRANSLATION WITH TRANSCODER

被引:28
|
作者
Kano, Takatomo [1 ]
Sakti, Sakriani [1 ,2 ]
Nakamura, Satoshi [1 ,2 ]
机构
[1] Nara Inst Sci & Technol, Ikoma, Japan
[2] RIKEN, Ctr Adv Intelligence Project AIP, Wako, Saitama, Japan
来源
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT) | 2021年
关键词
speech-to-speech translation; Transcoder; Transformer; sequence-to-sequence model; multitask learning; MODELS;
D O I
10.1109/SLT48900.2021.9383496
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Traditional speech translation systems use a cascade manner that concatenates speech recognition (ASR), machine translation (MT), and text-to-speech (TTS) synthesis to translate speech from one language to another language in a step-by-step manner. Unfortunately, since those components are trained separately, MT often struggles to handle ASR errors, resulting in unnatural translation results. Recently, one work attempted to construct direct speech translation in a single model. The model used a multi-task scheme that learns to predict not only the target speech spectrograms directly but also the source and target phoneme transcription as auxiliary tasks. However, that work was only evaluated Spanish-English language pairs with similar syntax and word order. With syntactically distant language pairs, speech translation requires distant word order, and thus direct speech frame-to-frame alignments become difficult. Another direction was to construct a single deep-learning framework while keeping the step-by-step translation process. However, such studies focused only on speech-to-text translation. Furthermore, all of these works were based on a recurrent neural network (RNN) model. In this work, we propose a step-by-step scheme to a complete end-to-end speech-to-speech translation and propose a Transformer-based speech translation using Transcoder. We compare our proposed and multi-task model using syntactically similar and distant language pairs.
引用
收藏
页码:958 / 965
页数:8
相关论文
共 50 条
  • [21] Leveraging Pseudo-labeled Data to Improve Direct Speech-to-Speech Translation
    Dong, Qianqian
    Yue, Fengpeng
    Ko, Tom
    Wang, Mingxuan
    Bai, Qibing
    Zhang, Yu
    INTERSPEECH 2022, 2022, : 1781 - 1785
  • [22] Multilingual speech-to-speech translation system: VoiceTra
    Matsuda, Shigeki
    Hu, Xinhui
    Shiga, Yoshinori
    Kashioka, Hideki
    Hori, Chiori
    Yasuda, Keiji
    Okuma, Hideo
    Uchiyama, Masao
    Sumita, Eiichiro
    Kawai, Hisashi
    Nakamura, Satoshi
    2013 IEEE 14TH INTERNATIONAL CONFERENCE ON MOBILE DATA MANAGEMENT (MDM 2013), VOL 2, 2013, : 229 - 233
  • [23] Research opportunities in automatic speech-to-speech translation
    Stüker, S. (stuker@ira.uka.de), 2012, Institute of Electrical and Electronics Engineers Inc. (31):
  • [24] From Speech-to-Speech Translation to Automatic Dubbing
    Federico, Marcello
    Enyedi, Robert
    Barra-Chicote, Roberto
    Giri, Ritwik
    Isik, Umut
    Krishnaswamy, Arvindh
    Sawaf, Hassan
    17TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE TRANSLATION (IWSLT 2020), 2020, : 257 - 264
  • [25] Pattern recognition approaches for speech-to-speech translation
    Casacuberta, F
    Vidal, E
    Sanchis, A
    Vilar, JM
    CYBERNETICS AND SYSTEMS, 2004, 35 (01) : 3 - 17
  • [26] UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units
    Inaguma, Hirofumi
    Popuri, Sravya
    Kulikov, Ilia
    Chen, Peng-Jen
    Wang, Changhan
    Chung, Yu-An
    Tang, Yun
    Lee, Ann
    Watanabe, Shinji
    Pino, Juan
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 15655 - 15680
  • [27] Finite-state speech-to-speech translation
    Vidal, E
    1997 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I - V: VOL I: PLENARY, EXPERT SUMMARIES, SPECIAL, AUDIO, UNDERWATER ACOUSTICS, VLSI; VOL II: SPEECH PROCESSING; VOL III: SPEECH PROCESSING, DIGITAL SIGNAL PROCESSING; VOL IV: MULTIDIMENSIONAL SIGNAL PROCESSING, NEURAL NETWORKS - VOL V: STATISTICAL SIGNAL AND ARRAY PROCESSING, APPLICATIONS, 1997, : 111 - 114
  • [28] ASSESSING EVALUATION METRICS FOR SPEECH-TO-SPEECH TRANSLATION
    Salesky, Elizabeth
    Maeder, Julian
    Klinger, Severin
    2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 733 - 740
  • [29] Incremental Dialog Clustering For Speech-to-Speech Translation
    Stallard, David
    Tsakalidis, Stavros
    Saleem, Shirin
    INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 428 - 431
  • [30] Semantic transfer in speech-to-speech machine translation
    Abb, B
    Buschbeck-Wolf, B
    Tschernitschek, C
    NATURAL LANGUAGE PROCESSING AND SPEECH TECHNOLOGY: RESULTS OF THE 3RD KONVENS CONFERENCE, 1996, : 123 - 136