UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units

被引:0
|
作者
Inaguma, Hirofumi [1 ]
Popuri, Sravya [1 ]
Kulikov, Ilia [1 ]
Chen, Peng-Jen [1 ]
Wang, Changhan [1 ]
Chung, Yu-An [1 ]
Tang, Yun [1 ]
Lee, Ann [1 ]
Watanabe, Shinji [2 ]
Pino, Juan [1 ]
机构
[1] Meta AI, FAIR, New York, NY 10017 USA
[2] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
关键词
TRANSFORMER; MODELS;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Direct speech-to-speech translation (S2ST), in which all components can be optimized jointly, is advantageous over cascaded approaches to achieve fast inference with a simplified pipeline. We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units subsequently. We enhance the model performance by subword prediction in the first-pass decoder, advanced two-pass decoder architecture design and search strategy, and better training regularization. To leverage large amounts of unlabeled text data, we pre-train the first-pass text decoder based on the self-supervised denoising auto-encoding task. Experimental evaluations on benchmark datasets at various data scales demonstrate that UnitY outperforms a single-pass speech-to-unit translation model by 2.5-4.2 ASR-BLEU with 2.83x decoding speed-up. We show that the proposed methods boost the performance even when predicting spectrogram in the second pass. However, predicting discrete units achieves 2.51x decoding speed-up compared to that case.
引用
收藏
页码:15655 / 15680
页数:26
相关论文
共 50 条
  • [31] Textless Speech-to-Speech Translation on Real Data
    Lee, Ann
    Gong, Hongyu
    Duquenne, Paul-Ambroise
    Schwenk, Holger
    Chen, Peng-Jen
    Wang, Changhan
    Popuri, Sravya
    Adi, Yossi
    Pino, Juan
    Gu, Jiatao
    Hsu, Wei-Ning
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 860 - 872
  • [32] TOWARDS END-TO-END SPEECH-TO-TEXT TRANSLATION WITH TWO-PASS DECODING
    Sung, Tzu-Wei
    Liu, Jun-You
    Lee, Hung-yi
    Lee, Lin-shan
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 7175 - 7179
  • [33] Translatotron 2: High-quality direct speech-to-speech translation with voice preservation
    Jia, Ye
    Ramanovich, Michelle Tadmor
    Remez, Tal
    Pomerantz, Roi
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022, : 10120 - 10134
  • [34] TRANSFORMER BASED DELIBERATION FOR TWO-PASS SPEECH RECOGNITION
    Hu, Ke
    Pang, Ruoming
    Sainath, Tara N.
    Strohman, Trevor
    2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 68 - 74
  • [35] Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation
    Jia, Ye
    Ding, Yifan
    Bapna, Ankur
    Cherry, Colin
    Zhang, Yu
    Conneau, Alexis
    Morioka, Nobuyuki
    INTERSPEECH 2022, 2022, : 1721 - 1725
  • [36] Two-Pass End-to-End Speech Recognition
    Sainath, Tara N.
    Pang, Ruoming
    Rybach, David
    He, Yanzhang
    Prabhavalkar, Rohit
    Li, Wei
    Visontai, Mirko
    Liang, Qiao
    Strohman, Trevor
    Wu, Yonghui
    McGraw, Ian
    Chiu, Chung-Cheng
    INTERSPEECH 2019, 2019, : 2773 - 2777
  • [37] Speech-to-speech translation software on PDAs for travel conversation
    Isotani, Ryosuke
    Yamabana, Kiyoshi
    Ando, Shinichi
    Hanazawa, Ken
    Ishikawa, Shin-Ya
    Iso, Ken-Ichi
    NEC Research and Development, 2003, 44 (SPEC.): : 197 - 202
  • [38] CVSS Corpus and Massively Multilingual Speech-to-Speech Translation
    Jia, Ye
    Ramanovich, Michelle Tadmor
    Wang, Quan
    Zen, Heiga
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6691 - 6703
  • [39] A hand-held speech-to-speech translation system
    Zhou, BW
    Gao, YQ
    Sorensen, J
    Déchelotte, D
    Picheny, M
    ASRU'03: 2003 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING ASRU '03, 2003, : 664 - 669
  • [40] SPEECH-TO-SPEECH TRANSLATION BETWEEN UNTRANSCRIBED UNKNOWN LANGUAGES
    Tjandra, Andros
    Sakti, Sakriani
    Nakamura, Satoshi
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 593 - 600