TRANSFORMER-BASED DIRECT SPEECH-TO-SPEECH TRANSLATION WITH TRANSCODER

被引：28

作者：

Kano, Takatomo ^{[1
]}

Sakti, Sakriani ^{[1
,2
]}

Nakamura, Satoshi ^{[1
,2
]}

机构：

[1] Nara Inst Sci & Technol, Ikoma, Japan

[2] RIKEN, Ctr Adv Intelligence Project AIP, Wako, Saitama, Japan

来源：

2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT) | 2021年

关键词：

speech-to-speech translation; Transcoder; Transformer; sequence-to-sequence model; multitask learning; MODELS;

D O I：

10.1109/SLT48900.2021.9383496

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Traditional speech translation systems use a cascade manner that concatenates speech recognition (ASR), machine translation (MT), and text-to-speech (TTS) synthesis to translate speech from one language to another language in a step-by-step manner. Unfortunately, since those components are trained separately, MT often struggles to handle ASR errors, resulting in unnatural translation results. Recently, one work attempted to construct direct speech translation in a single model. The model used a multi-task scheme that learns to predict not only the target speech spectrograms directly but also the source and target phoneme transcription as auxiliary tasks. However, that work was only evaluated Spanish-English language pairs with similar syntax and word order. With syntactically distant language pairs, speech translation requires distant word order, and thus direct speech frame-to-frame alignments become difficult. Another direction was to construct a single deep-learning framework while keeping the step-by-step translation process. However, such studies focused only on speech-to-text translation. Furthermore, all of these works were based on a recurrent neural network (RNN) model. In this work, we propose a step-by-step scheme to a complete end-to-end speech-to-speech translation and propose a Transformer-based speech translation using Transcoder. We compare our proposed and multi-task model using syntactically similar and distant language pairs.

引用

页码：958 / 965

页数：8

共 50 条

[1] SIMULTANEOUS SPEECH-TO-SPEECH TRANSLATION SYSTEM WITH TRANSFORMER-BASED INCREMENTAL ASR, MT, AND TTS
Fukuda, Ryo
Novitasari, Sashi
Oka, Yui
Kano, Yasumasa
Yano, Yuki
Ko, Yuka
Tokuyama, Hirotaka
Doi, Kosuke
Yanagita, Tomoya
Sakti, Sakriani
Sudoh, Katsuhito
Nakamura, Satoshi
2021 24TH CONFERENCE OF THE ORIENTAL COCOSDA INTERNATIONAL COMMITTEE FOR THE CO-ORDINATION AND STANDARDISATION OF SPEECH DATABASES AND ASSESSMENT TECHNIQUES (O-COCOSDA), 2021, : 186 - 192
[2] Direct Vs Cascaded Speech-to-Speech Translation Using Transformer
Arya, Lalaram
Chowdhury, Amartya Roy
Prasanna, S. R. Mahadeva
SPEECH AND COMPUTER, SPECOM 2023, PT II, 2023, 14339 : 258 - 270
[3] Direct Speech-to-Speech Translation With Discrete Units
Lee, Ann
Chen, Peng-Jen
Wang, Changhan
Gu, Jiatao
Popuri, Sravya
Ma, Xutai
Polyak, Adam
Adi, Yossi
He, Qing
Tang, Yun
Pino, Juan
Hsu, Wei-Ning
PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 3327 - 3339
[4] A speech-to-speech translation based interface for tourism
Cettolo, M
Corazza, A
Lazzari, G
Pianesi, F
Pianta, E
Tovena, LM
INFORMATION AND COMMUNICATION TECHNOLOGIES IN TOURISM 1999, 1999, : 191 - 200
[5] Multiformer: A Head-Configurable Transformer-Based Model for Direct Speech Translation
Sant, Gerard
Gállego, Gerard I.
Alastruey, Belen
Costa-Jussà, Marta R.
NAACL 2022 - 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Student Research Workshop, 2022, : 277 - 284
[6] Multiformer: A Head-Configurable Transformer-Based Model for Direct Speech Translation
Sant, Gerard
Gallego, Gerard, I
Alastruey, Belen
Costa-Jussa, Marta R.
NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES: PROCEEDINGS OF THE STUDENT RESEARCH WORKSHOP, 2022, : 277 - 284
[7] Direct speech-to-speech translation with a sequence-to-sequence model
Jia, Ye
Weiss, Ron J.
Biadsy, Fadi
Macherey, Wolfgang
Johnson, Melvin
Chen, Zhifeng
Wu, Yonghui
INTERSPEECH 2019, 2019, : 1123 - 1127
[8] CORBA-based speech-to-speech translation system
Gruhn, R
Takashima, K
Nishino, A
Nakamura, S
ASRU 2001: IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING, CONFERENCE PROCEEDINGS, 2001, : 355 - 358
[9] Impacts of machine translation and speech synthesis on speech-to-speech translation
Hashimoto, Kei
Yamagishi, Junichi
Byrne, William
King, Simon
Tokuda, Keiichi
SPEECH COMMUNICATION, 2012, 54 (07) : 857 - 866
[10] Hierarchical Classification for Speech-to-Speech Translation
Ettelaie, Emil
Georgiou, Panayiotis G.
Narayanan, Shrikanth S.
11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2534 - 2537

← 1 2 3 4 5 →