Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning for Low-Resource Speech Recognition

被引:0
|
作者
Zheng, Guolin [1 ]
Xiao, Yubei [1 ]
Gong, Ke [2 ]
Zhou, Pan [3 ]
Liang, Xiaodan [1 ]
Lin, Liang [1 ,2 ]
机构
[1] Sun Yat Sen Univ, Guangzhou, Peoples R China
[2] Dark Matter AI Res, London, England
[3] Sea AI Lab, Linz, Austria
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Unifying acoustic and linguistic representation learning has become increasingly crucial to transfer the knowledge learned on the abundance of high-resource language data for low-resource speech recognition. Existing approaches simply cascade pre-trained acoustic and language models to learn the transfer from speech to text. However, how to solve the representation discrepancy of speech and text is unexplored, which hinders the utilization of acoustic and linguistic information. Moreover, previous works simply replace the embedding layer of the pre-trained language model with the acoustic features, which may cause the catastrophic forgetting problem. In this work, we introduce Wav-BERT, a cooperative acoustic and linguistic representation learning method to fuse and utilize the contextual information of speech and text. Specifically, we unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework. A Representation Aggregation Module is designed to aggregate acoustic and linguistic representation, and an Embedding Attention Module is introduced to incorporate acoustic information into BERT, which can effectively facilitate the cooperation of two pre-trained models and thus boost the representation learning. Extensive experiments show that our Wav-BERT significantly outperforms the existing approaches and achieves state-of-the-art performance on low-resource speech recognition.
引用
收藏
页码:2765 / 2777
页数:13
相关论文
共 50 条
  • [41] Weighted Gradient Pretrain for Low-Resource Speech Emotion Recognition
    Xie, Yue
    Liang, Ruiyu
    Zhao, Xiaoyan
    Liang, Zhenlin
    Du, Jing
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2022, E105D (07) : 1352 - 1355
  • [42] Language fusion via adapters for low-resource speech recognition
    Hu, Qing
    Zhang, Yan
    Zhang, Xianlei
    Han, Zongyu
    Liang, Xiuxia
    SPEECH COMMUNICATION, 2024, 158
  • [43] STOCHASTIC POOLING MAXOUT NETWORKS FOR LOW-RESOURCE SPEECH RECOGNITION
    Cai, Meng
    Shi, Yongzhe
    Liu, Jia
    2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,
  • [44] Cross-Lingual Self-training to Learn Multilingual Representation for Low-Resource Speech Recognition
    Zi-Qiang Zhang
    Yan Song
    Ming-Hui Wu
    Xin Fang
    Ian McLoughlin
    Li-Rong Dai
    Circuits, Systems, and Signal Processing, 2022, 41 : 6827 - 6843
  • [45] EXPLORING EFFECTIVE DATA UTILIZATION FOR LOW-RESOURCE SPEECH RECOGNITION
    Zhou, Zhikai
    Wang, Wei
    Zhang, Wangyou
    Qian, Yanmin
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8192 - 8196
  • [46] Adversarial Meta Sampling for Multilingual Low-Resource Speech Recognition
    Xiao, Yubei
    Gong, Ke
    Zhou, Pan
    Zheng, Guolin
    Liang, Xiaodan
    Lin, Liang
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 14112 - 14120
  • [47] Cross-Lingual Self-training to Learn Multilingual Representation for Low-Resource Speech Recognition
    Zhang, Zi-Qiang
    Song, Yan
    Wu, Ming-Hui
    Fang, Xin
    McLoughlin, Ian
    Dai, Li-Rong
    CIRCUITS SYSTEMS AND SIGNAL PROCESSING, 2022, 41 (12) : 6827 - 6843
  • [48] ANALYSIS OF X-VECTORS FOR LOW-RESOURCE SPEECH RECOGNITION
    Karafiat, Martin
    Vesely, Karel
    Cernocky, Jan Honza
    Profant, Jan
    Nytra, Jiri
    Hlavacek, Miroslav
    Pavlicek, Tomas
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6998 - 7002
  • [49] Low-resource automatic speech recognition and error analyses of oral cancer speech
    Halpern, Bence Mark
    Feng, Siyuan
    van Son, Rob
    van den Brekel, Michiel
    Scharenborg, Odette
    SPEECH COMMUNICATION, 2022, 141 : 14 - 27
  • [50] A hybrid acoustic model based on PDP coding for resolving articulation differences in low-resource speech recognition
    Zhu, Wenbo
    Jin, Hao
    Chen, Jianwen
    Luo, Lufeng
    Wang, Jinhai
    Lu, Qinghua
    Li, Aiyuan
    APPLIED ACOUSTICS, 2022, 192