TRANSFORMER-BASED DIRECT SPEECH-TO-SPEECH TRANSLATION WITH TRANSCODER

被引:28
|
作者
Kano, Takatomo [1 ]
Sakti, Sakriani [1 ,2 ]
Nakamura, Satoshi [1 ,2 ]
机构
[1] Nara Inst Sci & Technol, Ikoma, Japan
[2] RIKEN, Ctr Adv Intelligence Project AIP, Wako, Saitama, Japan
来源
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT) | 2021年
关键词
speech-to-speech translation; Transcoder; Transformer; sequence-to-sequence model; multitask learning; MODELS;
D O I
10.1109/SLT48900.2021.9383496
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Traditional speech translation systems use a cascade manner that concatenates speech recognition (ASR), machine translation (MT), and text-to-speech (TTS) synthesis to translate speech from one language to another language in a step-by-step manner. Unfortunately, since those components are trained separately, MT often struggles to handle ASR errors, resulting in unnatural translation results. Recently, one work attempted to construct direct speech translation in a single model. The model used a multi-task scheme that learns to predict not only the target speech spectrograms directly but also the source and target phoneme transcription as auxiliary tasks. However, that work was only evaluated Spanish-English language pairs with similar syntax and word order. With syntactically distant language pairs, speech translation requires distant word order, and thus direct speech frame-to-frame alignments become difficult. Another direction was to construct a single deep-learning framework while keeping the step-by-step translation process. However, such studies focused only on speech-to-text translation. Furthermore, all of these works were based on a recurrent neural network (RNN) model. In this work, we propose a step-by-step scheme to a complete end-to-end speech-to-speech translation and propose a Transformer-based speech translation using Transcoder. We compare our proposed and multi-task model using syntactically similar and distant language pairs.
引用
收藏
页码:958 / 965
页数:8
相关论文
共 50 条
  • [31] Speech-to-speech Low-resource Translation
    Liu, Hsiao-Chuan
    Day, Min-Yuh
    Wang, Chih-Chien
    2023 IEEE 24TH INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION FOR DATA SCIENCE, IRI, 2023, : 91 - 95
  • [32] The impact of ASR on speech-to-speech translation performance
    Sarikaya, Ruhi
    Zhou, Bowen
    Povey, Daniel
    Afify, Mohamed
    Gao, Yuqing
    2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL IV, PTS 1-3, 2007, : 1289 - +
  • [33] Stress Transfer in Speech-to-Speech Machine Translation
    Akarsh, Sai
    Narasinga, Vamshiraghusimha
    Vuppala, Anil Kumar
    INTERSPEECH 2024, 2024, : 995 - 996
  • [34] The ATR multilingual speech-to-speech translation system
    Nakamura, S
    Markov, K
    Nakaiwa, H
    Kikui, G
    Kawai, H
    Jitsuhiro, T
    Zhang, JS
    Yamamoto, H
    Sumita, E
    Yamamoto, S
    IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2006, 14 (02): : 365 - 376
  • [35] Applications of Language Modeling in Speech-To-Speech Translation
    Liu, Fu-Hua
    Gu, Liang
    Gao, Yuqing
    Picheny, Michael
    International Journal of Speech Technology, 2004, 7 (2-3) : 221 - 229
  • [36] INTENT TRANSFER IN SPEECH-TO-SPEECH MACHINE TRANSLATION
    Anumanchipalli, Gopala Krishna
    Oliveira, Luis C.
    Black, Alan W.
    2012 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2012), 2012, : 153 - 158
  • [37] Textless Speech-to-Speech Translation on Real Data
    Lee, Ann
    Gong, Hongyu
    Duquenne, Paul-Ambroise
    Schwenk, Holger
    Chen, Peng-Jen
    Wang, Changhan
    Popuri, Sravya
    Adi, Yossi
    Pino, Juan
    Gu, Jiatao
    Hsu, Wei-Ning
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 860 - 872
  • [38] Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer
    Wang, Yongqi
    Bai, Jionghao
    Huang, Rongjie
    Li, Ruiqi
    Hong, Zhiqing
    Zhao, Zhou
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 4: STUDENT RESEARCH WORKSHOP, 2024, : 34 - 41
  • [39] An ARM-based embedded system design for speech-to-speech translation
    Lin, Shun-Chieh
    Wang, Jhing-Fa
    Wang, Jia-Ching
    Yang, Hsueh-Wei
    EMBEDDED AND UBIQUITOUS COMPUTING, PROCEEDINGS, 2006, 4096 : 499 - 508
  • [40] Tibetan-Chinese speech-to-speech translation based on discrete units
    Gong, Zairan
    Xu, Xiaona
    Zhao, Yue
    SCIENTIFIC REPORTS, 2025, 15 (01):