An improved wav2vec 2.0 pre-training approach using enhanced local dependency modeling for speech recognition

被引:2
|
作者
Zhu, Qiu-shi [1 ]
Zhang, Jie [1 ]
Wu, Ming-hui [2 ]
Fang, Xin [1 ,2 ]
Dai, Li-Rong [1 ]
机构
[1] Univ Sci & Technol China USTC, NEL SLIP, Hefei, Peoples R China
[2] iFlytek Co Ltd, iFlytek Res, Hefei, Peoples R China
来源
INTERSPEECH 2021 | 2021年
基金
国家重点研发计划;
关键词
Speech recognition; pre-training; wav2vec; 2.0; transformer; low-resource; local and global dependence; TRANSFORMER;
D O I
10.21437/Interspeech.2021-67
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Wav2vec 2.0 is a recently proposed self-supervised pre-training framework for learning speech representation. It utilizes a transformer to learn global contextual representation, which is effective especially in low-resource scenarios. Besides, it was shown that combining convolution neural network and transformer to model both local and global dependencies is beneficial for e.g., automatic speech recognition (ASR), natural language processing (NLP). However, how to model the local and global dependence in pre-training models is still an open question in the speech domain. In this paper, we therefore propose a new transformer encoder for enhancing the local dependency by combining convolution and self-attention modules. The transformer encoder first parallels the convolution and self-attention modules, and then serialized with another convolution module, sandwiched by a pair of feed forward modules. Experimental results show that the pre-trained model using the proposed method can reduce the word error rate (WER) compared to the reproduced wav2vec 2.0 at the cost of slightly increasing the size of training parameters.
引用
收藏
页码:4334 / 4338
页数:5
相关论文
共 50 条
  • [31] Exploring the influence of fine-tuning data on wav2vec 2.0 model for blind speech quality prediction
    Becerra, Helard
    Ragano, Alessandro
    Hines, Andrew
    INTERSPEECH 2022, 2022, : 4088 - 4092
  • [32] Improving Speech Translation Accuracy and Time Efficiency With Fine-Tuned wav2vec 2.0-Based Speech Segmentation
    Fukuda, Ryo
    Sudoh, Katsuhito
    Nakamura, Satoshi
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 906 - 916
  • [33] Improving Tone Recognition Performance using Wav2vec 2.0-Based Learned Representation in Yoruba, a Low-Resourced Language
    Obiang, Saint germes b. bengono
    Tsopze, Norbert
    Yonta, Paulin melatagia
    Bonastre, Jean-francois
    Jimenez, Tania
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2024, 23 (12)
  • [34] WavBERT: Exploiting Semantic and Non-semantic Speech using Wav2vec and BERT for Dementia Detection
    Zhu, Youxiang
    Obyat, Abdelrahman
    Liang, Xiaohui
    Batsis, John A.
    Roth, Robert M.
    INTERSPEECH 2021, 2021, : 3790 - 3794
  • [35] Transfer Ability of Monolingual Wav2vec2.0 for Low-resource Speech Recognition
    Yi, Cheng
    Wang, Jianzong
    Cheng, Ning
    Zhou, Shiyu
    Xu, Bo
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [36] SENTIMENT-AWARE AUTOMATIC SPEECH RECOGNITION PRE-TRAINING FOR ENHANCED SPEECH EMOTION RECOGNITION
    Ghriss, Ayoub
    Yang, Bo
    Rozgic, Viktor
    Shriberg, Elizabeth
    Wang, Chao
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7347 - 7351
  • [37] Speech emotion recognition using fine-tuned Wav2vec2.0 and neural controlleddifferential equations classifier
    Wang, Ni
    Yang, Danyu
    PLOS ONE, 2025, 20 (02):
  • [38] Wav2vec-MoE: An unsupervised pre-training and adaptation method for multi-accent ASR
    Lin, Yuqin
    Zhang, Shiliang
    Gao, Zhifu
    Wang, Longbiao
    Yang, Yanbing
    Dang, Jianwu
    ELECTRONICS LETTERS, 2023, 59 (11)
  • [39] Wav2vec-S: Semi-Supervised Pre-Training for Low-Resource ASR
    Zhu, Han
    Wang, Li
    Wang, Jindong
    Cheng, Gaofeng
    Zhang, Pengyuan
    Yan, Yonghong
    INTERSPEECH 2022, 2022, : 4870 - 4874
  • [40] K-Wav2vec 2.0: Automatic Speech Recognition based on Joint Decoding of Graphemes and Syllables
    Kim, Jounghee
    Kang, Pilsung
    INTERSPEECH 2022, 2022, : 4945 - 4949