Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning for Low-Resource Speech Recognition

被引:0
|
作者
Zheng, Guolin [1 ]
Xiao, Yubei [1 ]
Gong, Ke [2 ]
Zhou, Pan [3 ]
Liang, Xiaodan [1 ]
Lin, Liang [1 ,2 ]
机构
[1] Sun Yat Sen Univ, Guangzhou, Peoples R China
[2] Dark Matter AI Res, London, England
[3] Sea AI Lab, Linz, Austria
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Unifying acoustic and linguistic representation learning has become increasingly crucial to transfer the knowledge learned on the abundance of high-resource language data for low-resource speech recognition. Existing approaches simply cascade pre-trained acoustic and language models to learn the transfer from speech to text. However, how to solve the representation discrepancy of speech and text is unexplored, which hinders the utilization of acoustic and linguistic information. Moreover, previous works simply replace the embedding layer of the pre-trained language model with the acoustic features, which may cause the catastrophic forgetting problem. In this work, we introduce Wav-BERT, a cooperative acoustic and linguistic representation learning method to fuse and utilize the contextual information of speech and text. Specifically, we unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework. A Representation Aggregation Module is designed to aggregate acoustic and linguistic representation, and an Embedding Attention Module is introduced to incorporate acoustic information into BERT, which can effectively facilitate the cooperation of two pre-trained models and thus boost the representation learning. Extensive experiments show that our Wav-BERT significantly outperforms the existing approaches and achieves state-of-the-art performance on low-resource speech recognition.
引用
收藏
页码:2765 / 2777
页数:13
相关论文
共 50 条
  • [1] Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for Low-Resource Speech Recognition
    Yi, Cheng
    Zhou, Shiyu
    Xu, Bo
    IEEE SIGNAL PROCESSING LETTERS, 2021, 28 (28) : 788 - 792
  • [2] Acoustic Modeling Based on Deep Learning for Low-Resource Speech Recognition: An Overview
    Yu, Chongchong
    Kang, Meng
    Chen, Yunbing
    Wu, Jiajia
    Zhao, Xia
    IEEE ACCESS, 2020, 8 : 163829 - 163843
  • [3] MixRep: Hidden Representation Mixup for Low-Resource Speech Recognition
    Xie, Jiamin
    Hansen, John H. L.
    INTERSPEECH 2023, 2023, : 1304 - 1308
  • [4] Multilingual acoustic models for speech recognition in low-resource devices
    Garcia, Enrique Gil
    Mengusoglu, Erhan
    Janke, Eric
    2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL IV, PTS 1-3, 2007, : 981 - +
  • [5] Acoustic Modeling for Hindi Speech Recognition in Low-Resource Settings
    Dey, Anik
    Zhang, Weibin
    Fung, Pascale
    2014 INTERNATIONAL CONFERENCE ON AUDIO, LANGUAGE AND IMAGE PROCESSING (ICALIP), VOLS 1-2, 2014, : 891 - 894
  • [6] Transfer Ability of Monolingual Wav2vec2.0 for Low-resource Speech Recognition
    Yi, Cheng
    Wang, Jianzong
    Cheng, Ning
    Zhou, Shiyu
    Xu, Bo
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [7] Low-resource Sinhala Speech Recognition using Deep Learning
    Karunathilaka, Hirunika
    Welgama, Viraj
    Nadungodage, Thilini
    Weerasinghe, Ruvan
    2020 20TH INTERNATIONAL CONFERENCE ON ADVANCES IN ICT FOR EMERGING REGIONS (ICTER-2020), 2020, : 196 - 201
  • [8] Meta adversarial learning improves low-resource speech recognition
    Chen, Yaqi
    Yang, Xukui
    Zhang, Hao
    Zhang, Wenlin
    Qu, Dan
    Chen, Cong
    COMPUTER SPEECH AND LANGUAGE, 2024, 84
  • [9] META-LEARNING FOR LOW-RESOURCE SPEECH EMOTION RECOGNITION
    Chopra, Suransh
    Mathur, Puneet
    Sawhney, Ramit
    Shah, Rajiv Ratn
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6259 - 6263
  • [10] Language-Universal Phonetic Representation in Multilingual Speech Pretraining for Low-Resource Speech Recognition
    Feng, Siyuan
    Tu, Ming
    Xia, Rui
    Huang, Chuanzeng
    Wang, Yuxuan
    INTERSPEECH 2023, 2023, : 1384 - 1388