UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units

被引:0
|
作者
Inaguma, Hirofumi [1 ]
Popuri, Sravya [1 ]
Kulikov, Ilia [1 ]
Chen, Peng-Jen [1 ]
Wang, Changhan [1 ]
Chung, Yu-An [1 ]
Tang, Yun [1 ]
Lee, Ann [1 ]
Watanabe, Shinji [2 ]
Pino, Juan [1 ]
机构
[1] Meta AI, FAIR, New York, NY 10017 USA
[2] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
关键词
TRANSFORMER; MODELS;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Direct speech-to-speech translation (S2ST), in which all components can be optimized jointly, is advantageous over cascaded approaches to achieve fast inference with a simplified pipeline. We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units subsequently. We enhance the model performance by subword prediction in the first-pass decoder, advanced two-pass decoder architecture design and search strategy, and better training regularization. To leverage large amounts of unlabeled text data, we pre-train the first-pass text decoder based on the self-supervised denoising auto-encoding task. Experimental evaluations on benchmark datasets at various data scales demonstrate that UnitY outperforms a single-pass speech-to-unit translation model by 2.5-4.2 ASR-BLEU with 2.83x decoding speed-up. We show that the proposed methods boost the performance even when predicting spectrogram in the second pass. However, predicting discrete units achieves 2.51x decoding speed-up compared to that case.
引用
收藏
页码:15655 / 15680
页数:26
相关论文
共 50 条
  • [1] Direct Speech-to-Speech Translation With Discrete Units
    Lee, Ann
    Chen, Peng-Jen
    Wang, Changhan
    Gu, Jiatao
    Popuri, Sravya
    Ma, Xutai
    Polyak, Adam
    Adi, Yossi
    He, Qing
    Tang, Yun
    Pino, Juan
    Hsu, Wei-Ning
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 3327 - 3339
  • [2] Tibetan-Chinese speech-to-speech translation based on discrete units
    Gong, Zairan
    Xu, Xiaona
    Zhao, Yue
    SCIENTIFIC REPORTS, 2025, 15 (01):
  • [3] TRANSFORMER-BASED DIRECT SPEECH-TO-SPEECH TRANSLATION WITH TRANSCODER
    Kano, Takatomo
    Sakti, Sakriani
    Nakamura, Satoshi
    2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 958 - 965
  • [4] Direct speech-to-speech translation with a sequence-to-sequence model
    Jia, Ye
    Weiss, Ron J.
    Biadsy, Fadi
    Macherey, Wolfgang
    Johnson, Melvin
    Chen, Zhifeng
    Wu, Yonghui
    INTERSPEECH 2019, 2019, : 1123 - 1127
  • [5] Direct Vs Cascaded Speech-to-Speech Translation Using Transformer
    Arya, Lalaram
    Chowdhury, Amartya Roy
    Prasanna, S. R. Mahadeva
    SPEECH AND COMPUTER, SPECOM 2023, PT II, 2023, 14339 : 258 - 270
  • [6] Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer
    Wang, Yongqi
    Bai, Jionghao
    Huang, Rongjie
    Li, Ruiqi
    Hong, Zhiqing
    Zhao, Zhou
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 4: STUDENT RESEARCH WORKSHOP, 2024, : 34 - 41
  • [7] Impacts of machine translation and speech synthesis on speech-to-speech translation
    Hashimoto, Kei
    Yamagishi, Junichi
    Byrne, William
    King, Simon
    Tokuda, Keiichi
    SPEECH COMMUNICATION, 2012, 54 (07) : 857 - 866
  • [8] Hierarchical Classification for Speech-to-Speech Translation
    Ettelaie, Emil
    Georgiou, Panayiotis G.
    Narayanan, Shrikanth S.
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2534 - 2537
  • [9] The NESPOLE! speech-to-speech translation system
    Lavie, A
    Levin, L
    Frederking, R
    Pianesi, F
    MACHINE TRANSLATION: FROM RESEARCH TO REAL USERS, 2002, 2499 : 240 - 243
  • [10] Towards Machine Speech-to-speech Translation
    Satoshi, Nakamura
    Sudoh, Katsuhito
    Sakti, Sakriani
    TRADUMATICA-TRADUCCIO I TECNOLOGIES DE LA INFORMACIO I LA COMUNICACIO, 2019, (17): : 81 - 87