An improved wav2vec 2.0 pre-training approach using enhanced local dependency modeling for speech recognition

被引：2

作者：

Zhu, Qiu-shi ^{[1
]}

Zhang, Jie ^{[1
]}

Wu, Ming-hui ^{[2
]}

Fang, Xin ^{[1
,2
]}

Dai, Li-Rong ^{[1
]}

机构：

[1] Univ Sci & Technol China USTC, NEL SLIP, Hefei, Peoples R China

[2] iFlytek Co Ltd, iFlytek Res, Hefei, Peoples R China

来源：

INTERSPEECH 2021 | 2021年

基金：

国家重点研发计划;

关键词：

Speech recognition; pre-training; wav2vec; 2.0; transformer; low-resource; local and global dependence; TRANSFORMER;

D O I：

10.21437/Interspeech.2021-67

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Wav2vec 2.0 is a recently proposed self-supervised pre-training framework for learning speech representation. It utilizes a transformer to learn global contextual representation, which is effective especially in low-resource scenarios. Besides, it was shown that combining convolution neural network and transformer to model both local and global dependencies is beneficial for e.g., automatic speech recognition (ASR), natural language processing (NLP). However, how to model the local and global dependence in pre-training models is still an open question in the speech domain. In this paper, we therefore propose a new transformer encoder for enhancing the local dependency by combining convolution and self-attention modules. The transformer encoder first parallels the convolution and self-attention modules, and then serialized with another convolution module, sandwiched by a pair of feed forward modules. Experimental results show that the pre-trained model using the proposed method can reduce the word error rate (WER) compared to the reproduced wav2vec 2.0 at the cost of slightly increasing the size of training parameters.

引用

页码：4334 / 4338

页数：5

共 50 条

[31] Exploring the influence of fine-tuning data on wav2vec 2.0 model for blind speech quality prediction
Becerra, Helard
Ragano, Alessandro
Hines, Andrew
INTERSPEECH 2022, 2022, : 4088 - 4092
[32] Improving Speech Translation Accuracy and Time Efficiency With Fine-Tuned wav2vec 2.0-Based Speech Segmentation
Fukuda, Ryo
Sudoh, Katsuhito
Nakamura, Satoshi
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 906 - 916
[33] Improving Tone Recognition Performance using Wav2vec 2.0-Based Learned Representation in Yoruba, a Low-Resourced Language
Obiang, Saint germes b. bengono
Tsopze, Norbert
Yonta, Paulin melatagia
Bonastre, Jean-francois
Jimenez, Tania
ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2024, 23 (12)
[34] WavBERT: Exploiting Semantic and Non-semantic Speech using Wav2vec and BERT for Dementia Detection
Zhu, Youxiang
Obyat, Abdelrahman
Liang, Xiaohui
Batsis, John A.
Roth, Robert M.
INTERSPEECH 2021, 2021, : 3790 - 3794
[35] Transfer Ability of Monolingual Wav2vec2.0 for Low-resource Speech Recognition
Yi, Cheng
Wang, Jianzong
Cheng, Ning
Zhou, Shiyu
Xu, Bo
2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
[36] SENTIMENT-AWARE AUTOMATIC SPEECH RECOGNITION PRE-TRAINING FOR ENHANCED SPEECH EMOTION RECOGNITION
Ghriss, Ayoub
Yang, Bo
Rozgic, Viktor
Shriberg, Elizabeth
Wang, Chao
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7347 - 7351
[37] Speech emotion recognition using fine-tuned Wav2vec2.0 and neural controlleddifferential equations classifier
Wang, Ni
Yang, Danyu
PLOS ONE, 2025, 20 (02):
[38] Wav2vec-MoE: An unsupervised pre-training and adaptation method for multi-accent ASR
Lin, Yuqin
Zhang, Shiliang
Gao, Zhifu
Wang, Longbiao
Yang, Yanbing
Dang, Jianwu
ELECTRONICS LETTERS, 2023, 59 (11)
[39] Wav2vec-S: Semi-Supervised Pre-Training for Low-Resource ASR
Zhu, Han
Wang, Li
Wang, Jindong
Cheng, Gaofeng
Zhang, Pengyuan
Yan, Yonghong
INTERSPEECH 2022, 2022, : 4870 - 4874
[40] K-Wav2vec 2.0: Automatic Speech Recognition based on Joint Decoding of Graphemes and Syllables
Kim, Jounghee
Kang, Pilsung
INTERSPEECH 2022, 2022, : 4945 - 4949

← 1 2 3 4 5 →