ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation

被引:0
|
作者
Le, Chenyang [1 ,4 ]
Qian, Yao [2 ]
Zhou, Long [3 ]
Liu, Shujie [3 ]
Qian, Yanmin [1 ]
Zeng, Michael [2 ]
Huang, Xuedong [2 ]
机构
[1] Shanghai Jiao Tong Univ, Shanghai, Peoples R China
[2] Microsoft Cloud & AI, Redmond, WA USA
[3] Microsoft Res Asia, Beijing, Peoples R China
[4] Microsoft, Redmond, WA USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Joint speech-language training is challenging due to the large demand for training data and GPU consumption, as well as the modality gap between speech and language. We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models and optimized data-efficiently for spoken language tasks. Particularly, we propose to incorporate cross-modality learning into transfer learning and conduct them simultaneously for downstream tasks in a multi-task learning manner. Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks, achieving a new state-of-the-art average BLEU score of 31.5 on the multilingual speech to English text translation task for 21 languages, as measured on the public CoVoST2 evaluation set.(2)
引用
收藏
页数:12
相关论文
共 50 条
  • [21] End-to-End Speech Translation with Adversarial Training
    Li, Xuancai
    Chen, Kehai
    Zhao, Tiejun
    Yang, Muyun
    WORKSHOP ON AUTOMATIC SIMULTANEOUS TRANSLATION CHALLENGES, RECENT ADVANCES, AND FUTURE DIRECTIONS, 2020, : 10 - 14
  • [22] END-TO-END AUTOMATIC SPEECH TRANSLATION OF AUDIOBOOKS
    Berard, Alexandre
    Besacier, Laurent
    Kocabiyikoglu, Ali Can
    Pietquin, Olivier
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 6224 - 6228
  • [23] End-to-End Speech Translation with Knowledge Distillation
    Liu, Yuchen
    Xiong, Hao
    Zhang, Jiajun
    He, Zhongjun
    Wu, Hua
    Wang, Haifeng
    Zong, Chengqing
    INTERSPEECH 2019, 2019, : 1128 - 1132
  • [24] An End-to-End Speech Summarization Using Large Language Model
    Shang, Hengchao
    Li, Zongyao
    Guo, Jiaxin
    Li, Shaojun
    Rao, Zhiqiang
    Luo, Yuanchang
    Wei, Daimeng
    Yang, Hao
    INTERSPEECH 2024, 2024, : 1950 - 1954
  • [25] Fluent Translations from Disfluent Speech in End-to-End Speech Translation
    Salesky, Elizabeth
    Sperber, Matthias
    Waibel, Alex
    2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, 2019, : 2786 - 2792
  • [26] Internal Language Model Adaptation with Text-Only Data for End-to-End Speech Recognition
    Meng, Zhong
    Gaur, Yashesh
    Kanda, Naoyuki
    Li, Jinyu
    Chen, Xie
    Wu, Yu
    Gong, Yifan
    INTERSPEECH 2022, 2022, : 2608 - 2612
  • [27] An Experimental Methodology for an End-to-End Evaluation in Speech-to-Speech Translation
    Hamon, Olivier
    Mostefa, Djamel
    SIXTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, LREC 2008, 2008, : 3539 - 3546
  • [28] End-to-end evaluation in JANUS: A speech-to-speech translation system
    Gates, D
    Lavie, A
    Levin, L
    Waibel, A
    Gavalda, M
    Mayfield, L
    Woszczyna, M
    Zhan, PM
    DIALOGUE PROCESSING IN SPOKEN LANGUAGE SYSTEMS, 1997, 1236 : 195 - 206
  • [29] Speech-and-Text Transformer: Exploiting Unpaired Text for End-to-End Speech Recognition
    Wang, Qinyi
    Zhou, Xinyuan
    Li, Haizhou
    APSIPA TRANSACTIONS ON SIGNAL AND INFORMATION PROCESSING, 2023, 12 (01)
  • [30] End-to-End Speech Recognition of Tamil Language
    Changrampadi, Mohamed Hashim
    Shahina, A.
    Narayanan, M. Badri
    Khan, A. Nayeemulla
    INTELLIGENT AUTOMATION AND SOFT COMPUTING, 2022, 32 (02): : 1309 - 1323