TRANSFORMER-BASED DIRECT SPEECH-TO-SPEECH TRANSLATION WITH TRANSCODER

被引：28

作者：

Kano, Takatomo ^{[1
]}

Sakti, Sakriani ^{[1
,2
]}

Nakamura, Satoshi ^{[1
,2
]}

机构：

[1] Nara Inst Sci & Technol, Ikoma, Japan

[2] RIKEN, Ctr Adv Intelligence Project AIP, Wako, Saitama, Japan

来源：

2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT) | 2021年

关键词：

speech-to-speech translation; Transcoder; Transformer; sequence-to-sequence model; multitask learning; MODELS;

D O I：

10.1109/SLT48900.2021.9383496

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Traditional speech translation systems use a cascade manner that concatenates speech recognition (ASR), machine translation (MT), and text-to-speech (TTS) synthesis to translate speech from one language to another language in a step-by-step manner. Unfortunately, since those components are trained separately, MT often struggles to handle ASR errors, resulting in unnatural translation results. Recently, one work attempted to construct direct speech translation in a single model. The model used a multi-task scheme that learns to predict not only the target speech spectrograms directly but also the source and target phoneme transcription as auxiliary tasks. However, that work was only evaluated Spanish-English language pairs with similar syntax and word order. With syntactically distant language pairs, speech translation requires distant word order, and thus direct speech frame-to-frame alignments become difficult. Another direction was to construct a single deep-learning framework while keeping the step-by-step translation process. However, such studies focused only on speech-to-text translation. Furthermore, all of these works were based on a recurrent neural network (RNN) model. In this work, we propose a step-by-step scheme to a complete end-to-end speech-to-speech translation and propose a Transformer-based speech translation using Transcoder. We compare our proposed and multi-task model using syntactically similar and distant language pairs.

引用

页码：958 / 965

页数：8

共 50 条

[31] Speech-to-speech Low-resource Translation
Liu, Hsiao-Chuan
Day, Min-Yuh
Wang, Chih-Chien
2023 IEEE 24TH INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION FOR DATA SCIENCE, IRI, 2023, : 91 - 95
[32] The impact of ASR on speech-to-speech translation performance
Sarikaya, Ruhi
Zhou, Bowen
Povey, Daniel
Afify, Mohamed
Gao, Yuqing
2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL IV, PTS 1-3, 2007, : 1289 - +
[33] Stress Transfer in Speech-to-Speech Machine Translation
Akarsh, Sai
Narasinga, Vamshiraghusimha
Vuppala, Anil Kumar
INTERSPEECH 2024, 2024, : 995 - 996
[34] The ATR multilingual speech-to-speech translation system
Nakamura, S
Markov, K
Nakaiwa, H
Kikui, G
Kawai, H
Jitsuhiro, T
Zhang, JS
Yamamoto, H
Sumita, E
Yamamoto, S
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2006, 14 (02): : 365 - 376
[35] Applications of Language Modeling in Speech-To-Speech Translation
Liu, Fu-Hua
Gu, Liang
Gao, Yuqing
Picheny, Michael
International Journal of Speech Technology, 2004, 7 (2-3) : 221 - 229
[36] INTENT TRANSFER IN SPEECH-TO-SPEECH MACHINE TRANSLATION
Anumanchipalli, Gopala Krishna
Oliveira, Luis C.
Black, Alan W.
2012 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2012), 2012, : 153 - 158
[37] Textless Speech-to-Speech Translation on Real Data
Lee, Ann
Gong, Hongyu
Duquenne, Paul-Ambroise
Schwenk, Holger
Chen, Peng-Jen
Wang, Changhan
Popuri, Sravya
Adi, Yossi
Pino, Juan
Gu, Jiatao
Hsu, Wei-Ning
NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 860 - 872
[38] Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer
Wang, Yongqi
Bai, Jionghao
Huang, Rongjie
Li, Ruiqi
Hong, Zhiqing
Zhao, Zhou
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 4: STUDENT RESEARCH WORKSHOP, 2024, : 34 - 41
[39] An ARM-based embedded system design for speech-to-speech translation
Lin, Shun-Chieh
Wang, Jhing-Fa
Wang, Jia-Ching
Yang, Hsueh-Wei
EMBEDDED AND UBIQUITOUS COMPUTING, PROCEEDINGS, 2006, 4096 : 499 - 508
[40] Tibetan-Chinese speech-to-speech translation based on discrete units
Gong, Zairan
Xu, Xiaona
Zhao, Yue
SCIENTIFIC REPORTS, 2025, 15 (01):

← 1 2 3 4 5 →