INJECTING TEXT IN SELF-SUPERVISED SPEECH PRETRAINING

被引:10
|
作者
Chen, Zhehuai [1 ]
Zhang, Yu [1 ]
Rosenberg, Andrew [1 ]
Ramabhadran, Bhuvana [1 ]
Wang, Gary [1 ]
Moreno, Pedro [1 ]
机构
[1] Google Inc, Mountain View, CA 94043 USA
关键词
Speech Recognition; Speech Synthesis; Self-supervised; Representation learning; RECOGNITION;
D O I
10.1109/ASRU51503.2021.9688018
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Self-supervised pretraining for Automated Speech Recognition (ASR) has shown varied degrees of success. In this paper, we propose to jointly learn representations during pretraining from two different modalities: speech and text. The proposed method, tts4pretrain complements the power of contrastive learning in self-supervision with linguistic/lexical representations derived from synthesized speech, effectively learning from untranscribed speech and unspoken text. Lexical learning in the speech encoder is enforced through an additional sequence loss term that is coupled with contrastive loss during pretraining. We demonstrate that this novel pretraining method yields Word Error Rate (WER) reductions of 10% relative on the well-benchmarked, Librispeech task over a state-of-the-art baseline pretrained with wav2vec2.0 only. The proposed method also serves as an effective strategy to compensate for the lack of transcribed speech, effectively matching the performance of 5000 hours of transcribed speech with just 100 hours of transcribed speech on the AMI meeting transcription task. Finally, we demonstrate WER reductions of up to 15% on an in-house Voice Search task over traditional pretraining. Incorporating text into encoder pretraining is complimentary to rescoring with a larger or in-domain language model, resulting in additional 6% relative reduction in WER.
引用
收藏
页码:251 / 258
页数:8
相关论文
共 50 条
  • [31] Self-Supervised Pretraining for Cardiovascular Magnetic Resonance Cine Segmentation
    de Mooi, Rob A. J.
    Pluim, Iris O. W.
    Scannell, Cian M.
    DATA ENGINEERING IN MEDICAL IMAGING, DEMI 2024, 2025, 15265 : 115 - 124
  • [32] Does Self-Supervised Pretraining Really Match ImageNet Weights?
    Pototzky, Daniel
    Sultan, Azhar
    Schmidt-Thieme, Lars
    2022 IEEE 14TH IMAGE, VIDEO, AND MULTIDIMENSIONAL SIGNAL PROCESSING WORKSHOP (IVMSP), 2022,
  • [33] DIFFERENCING BASED SELF-SUPERVISED PRETRAINING FOR SCENE CHANGE DETECTION
    Ramkumar, Vijaya Raghavan T.
    Arani, Elahe
    Zonooz, Bahram
    CONFERENCE ON LIFELONG LEARNING AGENTS, VOL 199, 2022, 199
  • [34] Self-Supervised Pretraining With Monocular Height Estimation for Semantic Segmentation
    Xiong, Zhitong
    Chen, Sining
    Shi, Yilei
    Zhu, Xiao Xiang
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
  • [35] A METHOD FOR ROOF WIREFRAME RECONSTRUCTION BASED ON SELF-SUPERVISED PRETRAINING
    Yang, Hongxin
    Huang, Shangfeng
    Wang, Ruisheng
    ISPRS ANNALS OF THE PHOTOGRAMMETRY, REMOTE SENSING AND SPATIAL INFORMATION SCIENCES: VOLUME X-2-2024, 2024, : 239 - 246
  • [36] HYBRID TRANSFORMER NETWORK FOR CHANGE DETECTION UNDER SELF-SUPERVISED PRETRAINING
    Cui, Yongjing
    Zhuang, Yin
    Dong, Shan
    Zhang, Xinyi
    Gao, Peng
    Chen, He
    Chen, Liang
    IGARSS 2023 - 2023 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, 2023, : 6652 - 6655
  • [37] Weakly-Guided Self-Supervised Pretraining for Temporal Activity Detection
    Kahatapitiya, Kumara
    Ren, Zhou
    Li, Haoxiang
    Wu, Zhenyu
    Ryoo, Michael S.
    Hua, Gang
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 1, 2023, : 1078 - 1086
  • [38] SELF-SUPERVISED AUDIO ENCODER WITH CONTRASTIVE PRETRAINING FOR RESPIRATORY ANOMALY DETECTION
    Kulkarni, Shubham
    Watanabe, Hideaki
    Homma, Fuminori
    2023 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW, 2023,
  • [39] Movie Box Office Prediction With Self-Supervised and Visually Grounded Pretraining
    Chao, Qin
    Kim, Eunsoo
    Li, Boyang
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 1535 - 1540
  • [40] Self-Supervised Pretraining of Transformers for Satellite Image Time Series Classification
    Yuan, Yuan
    Lin, Lei
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2021, 14 : 474 - 487