Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning for Low-Resource Speech Recognition

被引:0
|
作者
Zheng, Guolin [1 ]
Xiao, Yubei [1 ]
Gong, Ke [2 ]
Zhou, Pan [3 ]
Liang, Xiaodan [1 ]
Lin, Liang [1 ,2 ]
机构
[1] Sun Yat Sen Univ, Guangzhou, Peoples R China
[2] Dark Matter AI Res, London, England
[3] Sea AI Lab, Linz, Austria
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Unifying acoustic and linguistic representation learning has become increasingly crucial to transfer the knowledge learned on the abundance of high-resource language data for low-resource speech recognition. Existing approaches simply cascade pre-trained acoustic and language models to learn the transfer from speech to text. However, how to solve the representation discrepancy of speech and text is unexplored, which hinders the utilization of acoustic and linguistic information. Moreover, previous works simply replace the embedding layer of the pre-trained language model with the acoustic features, which may cause the catastrophic forgetting problem. In this work, we introduce Wav-BERT, a cooperative acoustic and linguistic representation learning method to fuse and utilize the contextual information of speech and text. Specifically, we unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework. A Representation Aggregation Module is designed to aggregate acoustic and linguistic representation, and an Embedding Attention Module is introduced to incorporate acoustic information into BERT, which can effectively facilitate the cooperation of two pre-trained models and thus boost the representation learning. Extensive experiments show that our Wav-BERT significantly outperforms the existing approaches and achieves state-of-the-art performance on low-resource speech recognition.
引用
收藏
页码:2765 / 2777
页数:13
相关论文
共 50 条
  • [31] Low-Resource Speech Recognition and Keyword-Spotting
    Gales, Mark J. F.
    Knill, Kate M.
    Ragni, Anton
    SPEECH AND COMPUTER, SPECOM 2017, 2017, 10458 : 3 - 19
  • [32] ON SCALING CONTRASTIVE REPRESENTATIONS FOR LOW-RESOURCE SPEECH RECOGNITION
    Borgholt, Lasse
    Tax, Tycho M. S.
    Havtorn, Jakob D.
    Maaloe, Lars
    Igel, Christian
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 3885 - 3889
  • [33] IFF-WAV2VEC: Noise Robust Low-Resource Speech Recognition Based on Self-supervised Learning and Interactive Feature Fusion
    Cao, Jing
    Qian, Zhaopeng
    Yu, Chongchong
    Xie, Tao
    PROCEEDINGS OF 2023 6TH ARTIFICIAL INTELLIGENCE AND CLOUD COMPUTING CONFERENCE, AICCC 2023, 2023, : 232 - 237
  • [34] Improving speech depression detection using transfer learning with wav2vec 2.0 in low-resource environments
    Zhang, Xu
    Zhang, Xiangcheng
    Chen, Weisi
    Li, Chenlong
    Yu, Chengyuan
    SCIENTIFIC REPORTS, 2024, 14 (01):
  • [35] BrainTalker: Low-Resource Brain-to-Speech Synthesis with Transfer Learning using Wav2Vec 2.0
    Kim, Miseul
    Piao, Zhenyu
    Lee, Jihyun
    Kang, Hong-Goo
    2023 IEEE EMBS INTERNATIONAL CONFERENCE ON BIOMEDICAL AND HEALTH INFORMATICS, BHI, 2023,
  • [36] Transfer Learning from Multi-Lingual Speech Translation Benefits Low-Resource Speech Recognition
    Vanderreydt, Geoffroy
    Remy, Francois
    Demuynck, Kris
    INTERSPEECH 2022, 2022, : 3053 - 3057
  • [37] Systems for Low-Resource Speech Recognition Tasks in Open Automatic Speech Recognition and Formosa Speech Recognition Challenges
    Lin, Hung-Pang
    Zhang, Yu-Jia
    Chen, Chia-Ping
    INTERSPEECH 2021, 2021, : 4339 - 4343
  • [38] JOINT ACOUSTIC MODELING OF TRIPHONES AND TRIGRAPHEMES BY MULTI-TASK LEARNING DEEP NEURAL NETWORKS FOR LOW-RESOURCE SPEECH RECOGNITION
    Chen, Dongpeng
    Mak, Brian
    Leung, Cheung-Chi
    Sivadas, Sunil
    2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,
  • [39] Convolutional Maxout Neural Networks for Low-Resource Speech Recognition
    Cai, Meng
    Shi, Yongzhe
    Kang, Jian
    Liu, Jia
    Su, Tengrong
    2014 9TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2014, : 133 - +
  • [40] MIXSPEECH: DATA AUGMENTATION FOR LOW-RESOURCE AUTOMATIC SPEECH RECOGNITION
    Meng, Linghui
    Xu, Jin
    Tan, Xu
    Wang, Jindong
    Qin, Tao
    Xu, Bo
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7008 - 7012