UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units

被引：0

作者：

Inaguma, Hirofumi ^{[1
]}

Popuri, Sravya ^{[1
]}

Kulikov, Ilia ^{[1
]}

Chen, Peng-Jen ^{[1
]}

Wang, Changhan ^{[1
]}

Chung, Yu-An ^{[1
]}

Tang, Yun ^{[1
]}

Lee, Ann ^{[1
]}

Watanabe, Shinji ^{[2
]}

Pino, Juan ^{[1
]}

机构：

[1] Meta AI, FAIR, New York, NY 10017 USA

[2] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA

来源：

PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1 | 2023年

关键词：

TRANSFORMER; MODELS;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Direct speech-to-speech translation (S2ST), in which all components can be optimized jointly, is advantageous over cascaded approaches to achieve fast inference with a simplified pipeline. We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units subsequently. We enhance the model performance by subword prediction in the first-pass decoder, advanced two-pass decoder architecture design and search strategy, and better training regularization. To leverage large amounts of unlabeled text data, we pre-train the first-pass text decoder based on the self-supervised denoising auto-encoding task. Experimental evaluations on benchmark datasets at various data scales demonstrate that UnitY outperforms a single-pass speech-to-unit translation model by 2.5-4.2 ASR-BLEU with 2.83x decoding speed-up. We show that the proposed methods boost the performance even when predicting spectrogram in the second pass. However, predicting discrete units achieves 2.51x decoding speed-up compared to that case.

引用

页码：15655 / 15680

页数：26

共 50 条

[1] Direct Speech-to-Speech Translation With Discrete Units
Lee, Ann
Chen, Peng-Jen
Wang, Changhan
Gu, Jiatao
Popuri, Sravya
Ma, Xutai
Polyak, Adam
Adi, Yossi
He, Qing
Tang, Yun
Pino, Juan
Hsu, Wei-Ning
PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 3327 - 3339
[2] Tibetan-Chinese speech-to-speech translation based on discrete units
Gong, Zairan
Xu, Xiaona
Zhao, Yue
SCIENTIFIC REPORTS, 2025, 15 (01):
[3] TRANSFORMER-BASED DIRECT SPEECH-TO-SPEECH TRANSLATION WITH TRANSCODER
Kano, Takatomo
Sakti, Sakriani
Nakamura, Satoshi
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 958 - 965
[4] Direct speech-to-speech translation with a sequence-to-sequence model
Jia, Ye
Weiss, Ron J.
Biadsy, Fadi
Macherey, Wolfgang
Johnson, Melvin
Chen, Zhifeng
Wu, Yonghui
INTERSPEECH 2019, 2019, : 1123 - 1127
[5] Direct Vs Cascaded Speech-to-Speech Translation Using Transformer
Arya, Lalaram
Chowdhury, Amartya Roy
Prasanna, S. R. Mahadeva
SPEECH AND COMPUTER, SPECOM 2023, PT II, 2023, 14339 : 258 - 270
[6] Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer
Wang, Yongqi
Bai, Jionghao
Huang, Rongjie
Li, Ruiqi
Hong, Zhiqing
Zhao, Zhou
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 4: STUDENT RESEARCH WORKSHOP, 2024, : 34 - 41
[7] Impacts of machine translation and speech synthesis on speech-to-speech translation
Hashimoto, Kei
Yamagishi, Junichi
Byrne, William
King, Simon
Tokuda, Keiichi
SPEECH COMMUNICATION, 2012, 54 (07) : 857 - 866
[8] Hierarchical Classification for Speech-to-Speech Translation
Ettelaie, Emil
Georgiou, Panayiotis G.
Narayanan, Shrikanth S.
11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2534 - 2537
[9] The NESPOLE! speech-to-speech translation system
Lavie, A
Levin, L
Frederking, R
Pianesi, F
MACHINE TRANSLATION: FROM RESEARCH TO REAL USERS, 2002, 2499 : 240 - 243
[10] Towards Machine Speech-to-speech Translation
Satoshi, Nakamura
Sudoh, Katsuhito
Sakti, Sakriani
TRADUMATICA-TRADUCCIO I TECNOLOGIES DE LA INFORMACIO I LA COMUNICACIO, 2019, (17): : 81 - 87

← 1 2 3 4 5 →