TRANSFORMER-BASED DIRECT SPEECH-TO-SPEECH TRANSLATION WITH TRANSCODER

被引：28

作者：

Kano, Takatomo ^{[1
]}

Sakti, Sakriani ^{[1
,2
]}

Nakamura, Satoshi ^{[1
,2
]}

机构：

[1] Nara Inst Sci & Technol, Ikoma, Japan

[2] RIKEN, Ctr Adv Intelligence Project AIP, Wako, Saitama, Japan

来源：

2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT) | 2021年

关键词：

speech-to-speech translation; Transcoder; Transformer; sequence-to-sequence model; multitask learning; MODELS;

D O I：

10.1109/SLT48900.2021.9383496

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Traditional speech translation systems use a cascade manner that concatenates speech recognition (ASR), machine translation (MT), and text-to-speech (TTS) synthesis to translate speech from one language to another language in a step-by-step manner. Unfortunately, since those components are trained separately, MT often struggles to handle ASR errors, resulting in unnatural translation results. Recently, one work attempted to construct direct speech translation in a single model. The model used a multi-task scheme that learns to predict not only the target speech spectrograms directly but also the source and target phoneme transcription as auxiliary tasks. However, that work was only evaluated Spanish-English language pairs with similar syntax and word order. With syntactically distant language pairs, speech translation requires distant word order, and thus direct speech frame-to-frame alignments become difficult. Another direction was to construct a single deep-learning framework while keeping the step-by-step translation process. However, such studies focused only on speech-to-text translation. Furthermore, all of these works were based on a recurrent neural network (RNN) model. In this work, we propose a step-by-step scheme to a complete end-to-end speech-to-speech translation and propose a Transformer-based speech translation using Transcoder. We compare our proposed and multi-task model using syntactically similar and distant language pairs.

引用

页码：958 / 965

页数：8

共 50 条

[21] Leveraging Pseudo-labeled Data to Improve Direct Speech-to-Speech Translation
Dong, Qianqian
Yue, Fengpeng
Ko, Tom
Wang, Mingxuan
Bai, Qibing
Zhang, Yu
INTERSPEECH 2022, 2022, : 1781 - 1785
[22] Multilingual speech-to-speech translation system: VoiceTra
Matsuda, Shigeki
Hu, Xinhui
Shiga, Yoshinori
Kashioka, Hideki
Hori, Chiori
Yasuda, Keiji
Okuma, Hideo
Uchiyama, Masao
Sumita, Eiichiro
Kawai, Hisashi
Nakamura, Satoshi
2013 IEEE 14TH INTERNATIONAL CONFERENCE ON MOBILE DATA MANAGEMENT (MDM 2013), VOL 2, 2013, : 229 - 233
[23] Research opportunities in automatic speech-to-speech translation
Stüker, S. (stuker@ira.uka.de), 2012, Institute of Electrical and Electronics Engineers Inc. (31):
[24] From Speech-to-Speech Translation to Automatic Dubbing
Federico, Marcello
Enyedi, Robert
Barra-Chicote, Roberto
Giri, Ritwik
Isik, Umut
Krishnaswamy, Arvindh
Sawaf, Hassan
17TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE TRANSLATION (IWSLT 2020), 2020, : 257 - 264
[25] Pattern recognition approaches for speech-to-speech translation
Casacuberta, F
Vidal, E
Sanchis, A
Vilar, JM
CYBERNETICS AND SYSTEMS, 2004, 35 (01) : 3 - 17
[26] UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units
Inaguma, Hirofumi
Popuri, Sravya
Kulikov, Ilia
Chen, Peng-Jen
Wang, Changhan
Chung, Yu-An
Tang, Yun
Lee, Ann
Watanabe, Shinji
Pino, Juan
PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 15655 - 15680
[27] Finite-state speech-to-speech translation
Vidal, E
1997 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I - V: VOL I: PLENARY, EXPERT SUMMARIES, SPECIAL, AUDIO, UNDERWATER ACOUSTICS, VLSI; VOL II: SPEECH PROCESSING; VOL III: SPEECH PROCESSING, DIGITAL SIGNAL PROCESSING; VOL IV: MULTIDIMENSIONAL SIGNAL PROCESSING, NEURAL NETWORKS - VOL V: STATISTICAL SIGNAL AND ARRAY PROCESSING, APPLICATIONS, 1997, : 111 - 114
[28] ASSESSING EVALUATION METRICS FOR SPEECH-TO-SPEECH TRANSLATION
Salesky, Elizabeth
Maeder, Julian
Klinger, Severin
2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 733 - 740
[29] Incremental Dialog Clustering For Speech-to-Speech Translation
Stallard, David
Tsakalidis, Stavros
Saleem, Shirin
INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 428 - 431
[30] Semantic transfer in speech-to-speech machine translation
Abb, B
Buschbeck-Wolf, B
Tschernitschek, C
NATURAL LANGUAGE PROCESSING AND SPEECH TECHNOLOGY: RESULTS OF THE 3RD KONVENS CONFERENCE, 1996, : 123 - 136

← 1 2 3 4 5 →