An improved wav2vec 2.0 pre-training approach using enhanced local dependency modeling for speech recognition

被引:2
|
作者
Zhu, Qiu-shi [1 ]
Zhang, Jie [1 ]
Wu, Ming-hui [2 ]
Fang, Xin [1 ,2 ]
Dai, Li-Rong [1 ]
机构
[1] Univ Sci & Technol China USTC, NEL SLIP, Hefei, Peoples R China
[2] iFlytek Co Ltd, iFlytek Res, Hefei, Peoples R China
来源
基金
国家重点研发计划;
关键词
Speech recognition; pre-training; wav2vec; 2.0; transformer; low-resource; local and global dependence; TRANSFORMER;
D O I
10.21437/Interspeech.2021-67
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Wav2vec 2.0 is a recently proposed self-supervised pre-training framework for learning speech representation. It utilizes a transformer to learn global contextual representation, which is effective especially in low-resource scenarios. Besides, it was shown that combining convolution neural network and transformer to model both local and global dependencies is beneficial for e.g., automatic speech recognition (ASR), natural language processing (NLP). However, how to model the local and global dependence in pre-training models is still an open question in the speech domain. In this paper, we therefore propose a new transformer encoder for enhancing the local dependency by combining convolution and self-attention modules. The transformer encoder first parallels the convolution and self-attention modules, and then serialized with another convolution module, sandwiched by a pair of feed forward modules. Experimental results show that the pre-trained model using the proposed method can reduce the word error rate (WER) compared to the reproduced wav2vec 2.0 at the cost of slightly increasing the size of training parameters.
引用
收藏
页码:4334 / 4338
页数:5
相关论文
共 50 条
  • [1] wav2vec: Unsupervised Pre-training for Speech Recognition
    Schneider, Steffen
    Baevski, Alexei
    Collobert, Ronan
    Auli, Michael
    INTERSPEECH 2019, 2019, : 3465 - 3469
  • [2] Brazilian Portuguese Speech Recognition Using Wav2vec 2.0
    Stefanel Gris, Lucas Rafael
    Casanova, Edresson
    de Oliveira, Frederico Santos
    Soares, Anderson da Silva
    Candido Junior, Arnaldo
    COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, PROPOR 2022, 2022, 13208 : 333 - 343
  • [3] Emotion Recognition from Speech Using Wav2vec 2.0 Embeddings
    Pepino, Leonardo
    Riera, Pablo
    Ferrer, Luciana
    INTERSPEECH 2021, 2021, : 3400 - 3404
  • [4] Speech recognition model design for Sundanese language using WAV2VEC 2.0
    Cryssiover A.
    Zahra A.
    International Journal of Speech Technology, 2024, 27 (01) : 171 - 177
  • [5] Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training
    Hsu, Wei-Ning
    Sriram, Anuroop
    Baevski, Alexei
    Likhomanenko, Tatiana
    Xu, Qiantong
    Pratap, Vineel
    Kahn, Jacob
    Lee, Ann
    Collobert, Ronan
    Synnaeve, Gabriel
    Auli, Michael
    INTERSPEECH 2021, 2021, : 721 - 725
  • [6] WavFusion: Towards Wav2vec 2.0 Multimodal Speech Emotion Recognition
    Li, Feng
    Luo, Jiusong
    Xia, Wanjun
    MULTIMEDIA MODELING, MMM 2025, PT IV, 2025, 15523 : 325 - 336
  • [7] Detection of Prosodic Boundaries in Speech Using Wav2Vec 2.0
    Kunesova, Marie
    Rezackova, Marketa
    TEXT, SPEECH, AND DIALOGUE (TSD 2022), 2022, 13502 : 377 - 388
  • [8] Combining wav2vec 2.0 Fine-Tuning and ConLearnNet for Speech Emotion Recognition
    Sun, Chenjing
    Zhou, Yi
    Huang, Xin
    Yang, Jichen
    Hou, Xianhua
    ELECTRONICS, 2024, 13 (06)
  • [9] Speech Emotion Recognition Based on Shallow Structure of Wav2vec 2.0 and Attention Mechanism
    Zhang, Yumei
    Jia, Maoshen
    Cao, Xuan
    Zhao, Zichen
    2024 IEEE 14TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, ISCSLP 2024, 2024, : 398 - 402
  • [10] MULTI-LINGUAL MULTI-TASK SPEECH EMOTION RECOGNITION USING WAV2VEC 2.0
    Sharma, Mayank
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6907 - 6911