BLASER: A Text-Free Speech-to-Speech Translation Evaluation Metric

被引:0
|
作者
Chen, Mingda [1 ]
Duquenne, Paul-Ambroise [1 ]
Andrews, Pierre [1 ]
Kao, Justine [1 ]
Mourachko, Alexandre [1 ]
Schwenk, Holger [1 ]
Costa-Jussa, Marta R. [1 ]
机构
[1] Meta AI, Menlo Pk, CA 94025 USA
来源
PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1 | 2023年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
End-to-End speech-to-speech translation (S2ST) is generally evaluated with text-based metrics. This means that generated speech has to be automatically transcribed, making the evaluation dependent on the availability and quality of automatic speech recognition (ASR) systems. In this paper, we propose a text-free evaluation metric for end-to-end S2ST, named BLASER, to avoid the dependency on ASR systems. BLASER leverages a multilingual multimodal encoder to directly encode the speech segments for source input, translation output and reference into a shared embedding space and computes a score of the translation quality that can be used as a proxy to human evaluation. To evaluate our approach, we construct training and evaluation sets from more than 40k human annotations covering seven language directions. The best results of BLASER are achieved by training with supervision from human rating scores. We show that when evaluated at the sentence level, BLASER correlates significantly better with human judgment compared to ASRdependent metrics including ASR-SENTBLEU in all translation directions and ASR- COMET in five of them. Our analysis shows combining speech and text as inputs to BLASER does not increase the correlation with human scores, but best correlations are achieved when using speech, which motivates the goal of our research. Moreover, we show that using ASR for references is detrimental for text-based metrics.(1)
引用
收藏
页码:9064 / 9079
页数:16
相关论文
共 50 条
  • [1] ASSESSING EVALUATION METRICS FOR SPEECH-TO-SPEECH TRANSLATION
    Salesky, Elizabeth
    Maeder, Julian
    Klinger, Severin
    2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 733 - 740
  • [2] Unsupervised features from text for speech synthesis in a speech-to-speech translation system
    Watts, Oliver
    Zhou, Bowen
    12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 2164 - 2167
  • [3] Statistical Vowelization of Arabic Text for Speech Synthesis in Speech-to-Speech Translation Systems
    Gu, Liang
    Zhang, Wei
    Tahir, Lazkin
    Gao, Yuqing
    INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 509 - 512
  • [4] Generating Arabic text in multilingual speech-to-speech machine translation framework
    Monem, Azza Abdel
    Shaalan, Khaled
    Rafea, Ahmed
    Baraka, Hoda
    MACHINE TRANSLATION, 2008, 22 (04) : 205 - 258
  • [5] Impacts of machine translation and speech synthesis on speech-to-speech translation
    Hashimoto, Kei
    Yamagishi, Junichi
    Byrne, William
    King, Simon
    Tokuda, Keiichi
    SPEECH COMMUNICATION, 2012, 54 (07) : 857 - 866
  • [6] Hierarchical Classification for Speech-to-Speech Translation
    Ettelaie, Emil
    Georgiou, Panayiotis G.
    Narayanan, Shrikanth S.
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2534 - 2537
  • [7] The NESPOLE! speech-to-speech translation system
    Lavie, A
    Levin, L
    Frederking, R
    Pianesi, F
    MACHINE TRANSLATION: FROM RESEARCH TO REAL USERS, 2002, 2499 : 240 - 243
  • [8] Towards Machine Speech-to-speech Translation
    Satoshi, Nakamura
    Sudoh, Katsuhito
    Sakti, Sakriani
    TRADUMATICA-TRADUCCIO I TECNOLOGIES DE LA INFORMACIO I LA COMUNICACIO, 2019, (17): : 81 - 87
  • [9] Prosody generation for speech-to-speech translation
    Aguero, Pablo Daniel
    Adell, Jordi
    Bonafonte, Antonio
    2006 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-13, 2006, : 557 - 560
  • [10] Contextual reasoning in speech-to-speech translation
    Koch, S
    Küssner, U
    Stede, M
    Tidhar, D
    NATURAL LANGUAGE PROCESSING-NLP 2000, PROCEEDINGS, 2000, 1835 : 283 - 292