UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units

被引：0

作者：

Inaguma, Hirofumi ^{[1
]}

Popuri, Sravya ^{[1
]}

Kulikov, Ilia ^{[1
]}

Chen, Peng-Jen ^{[1
]}

Wang, Changhan ^{[1
]}

Chung, Yu-An ^{[1
]}

Tang, Yun ^{[1
]}

Lee, Ann ^{[1
]}

Watanabe, Shinji ^{[2
]}

Pino, Juan ^{[1
]}

机构：

[1] Meta AI, FAIR, New York, NY 10017 USA

[2] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA

来源：

PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1 | 2023年

关键词：

TRANSFORMER; MODELS;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Direct speech-to-speech translation (S2ST), in which all components can be optimized jointly, is advantageous over cascaded approaches to achieve fast inference with a simplified pipeline. We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units subsequently. We enhance the model performance by subword prediction in the first-pass decoder, advanced two-pass decoder architecture design and search strategy, and better training regularization. To leverage large amounts of unlabeled text data, we pre-train the first-pass text decoder based on the self-supervised denoising auto-encoding task. Experimental evaluations on benchmark datasets at various data scales demonstrate that UnitY outperforms a single-pass speech-to-unit translation model by 2.5-4.2 ASR-BLEU with 2.83x decoding speed-up. We show that the proposed methods boost the performance even when predicting spectrogram in the second pass. However, predicting discrete units achieves 2.51x decoding speed-up compared to that case.

引用

页码：15655 / 15680

页数：26

共 50 条

[31] Textless Speech-to-Speech Translation on Real Data
Lee, Ann
Gong, Hongyu
Duquenne, Paul-Ambroise
Schwenk, Holger
Chen, Peng-Jen
Wang, Changhan
Popuri, Sravya
Adi, Yossi
Pino, Juan
Gu, Jiatao
Hsu, Wei-Ning
NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 860 - 872
[32] TOWARDS END-TO-END SPEECH-TO-TEXT TRANSLATION WITH TWO-PASS DECODING
Sung, Tzu-Wei
Liu, Jun-You
Lee, Hung-yi
Lee, Lin-shan
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 7175 - 7179
[33] Translatotron 2: High-quality direct speech-to-speech translation with voice preservation
Jia, Ye
Ramanovich, Michelle Tadmor
Remez, Tal
Pomerantz, Roi
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022, : 10120 - 10134
[34] TRANSFORMER BASED DELIBERATION FOR TWO-PASS SPEECH RECOGNITION
Hu, Ke
Pang, Ruoming
Sainath, Tara N.
Strohman, Trevor
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 68 - 74
[35] Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation
Jia, Ye
Ding, Yifan
Bapna, Ankur
Cherry, Colin
Zhang, Yu
Conneau, Alexis
Morioka, Nobuyuki
INTERSPEECH 2022, 2022, : 1721 - 1725
[36] Two-Pass End-to-End Speech Recognition
Sainath, Tara N.
Pang, Ruoming
Rybach, David
He, Yanzhang
Prabhavalkar, Rohit
Li, Wei
Visontai, Mirko
Liang, Qiao
Strohman, Trevor
Wu, Yonghui
McGraw, Ian
Chiu, Chung-Cheng
INTERSPEECH 2019, 2019, : 2773 - 2777
[37] Speech-to-speech translation software on PDAs for travel conversation
Isotani, Ryosuke
Yamabana, Kiyoshi
Ando, Shinichi
Hanazawa, Ken
Ishikawa, Shin-Ya
Iso, Ken-Ichi
NEC Research and Development, 2003, 44 (SPEC.): : 197 - 202
[38] CVSS Corpus and Massively Multilingual Speech-to-Speech Translation
Jia, Ye
Ramanovich, Michelle Tadmor
Wang, Quan
Zen, Heiga
LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6691 - 6703
[39] A hand-held speech-to-speech translation system
Zhou, BW
Gao, YQ
Sorensen, J
Déchelotte, D
Picheny, M
ASRU'03: 2003 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING ASRU '03, 2003, : 664 - 669
[40] SPEECH-TO-SPEECH TRANSLATION BETWEEN UNTRANSCRIBED UNKNOWN LANGUAGES
Tjandra, Andros
Sakti, Sakriani
Nakamura, Satoshi
2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 593 - 600

← 1 2 3 4 5 →