An improved wav2vec 2.0 pre-training approach using enhanced local dependency modeling for speech recognition

被引：2

作者：

Zhu, Qiu-shi ^{[1
]}

Zhang, Jie ^{[1
]}

Wu, Ming-hui ^{[2
]}

Fang, Xin ^{[1
,2
]}

Dai, Li-Rong ^{[1
]}

机构：

[1] Univ Sci & Technol China USTC, NEL SLIP, Hefei, Peoples R China

[2] iFlytek Co Ltd, iFlytek Res, Hefei, Peoples R China

来源：

INTERSPEECH 2021 | 2021年

基金：

国家重点研发计划;

关键词：

Speech recognition; pre-training; wav2vec; 2.0; transformer; low-resource; local and global dependence; TRANSFORMER;

D O I：

10.21437/Interspeech.2021-67

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Wav2vec 2.0 is a recently proposed self-supervised pre-training framework for learning speech representation. It utilizes a transformer to learn global contextual representation, which is effective especially in low-resource scenarios. Besides, it was shown that combining convolution neural network and transformer to model both local and global dependencies is beneficial for e.g., automatic speech recognition (ASR), natural language processing (NLP). However, how to model the local and global dependence in pre-training models is still an open question in the speech domain. In this paper, we therefore propose a new transformer encoder for enhancing the local dependency by combining convolution and self-attention modules. The transformer encoder first parallels the convolution and self-attention modules, and then serialized with another convolution module, sandwiched by a pair of feed forward modules. Experimental results show that the pre-trained model using the proposed method can reduce the word error rate (WER) compared to the reproduced wav2vec 2.0 at the cost of slightly increasing the size of training parameters.

引用

页码：4334 / 4338

页数：5

共 50 条

[1] wav2vec: Unsupervised Pre-training for Speech Recognition
Schneider, Steffen
Baevski, Alexei
Collobert, Ronan
Auli, Michael
INTERSPEECH 2019, 2019, : 3465 - 3469
[2] Brazilian Portuguese Speech Recognition Using Wav2vec 2.0
Stefanel Gris, Lucas Rafael
Casanova, Edresson
de Oliveira, Frederico Santos
Soares, Anderson da Silva
Candido Junior, Arnaldo
COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, PROPOR 2022, 2022, 13208 : 333 - 343
[3] Emotion Recognition from Speech Using Wav2vec 2.0 Embeddings
Pepino, Leonardo
Riera, Pablo
Ferrer, Luciana
INTERSPEECH 2021, 2021, : 3400 - 3404
[4] Speech recognition model design for Sundanese language using WAV2VEC 2.0
Cryssiover A.
Zahra A.
International Journal of Speech Technology, 2024, 27 (01) : 171 - 177
[5] Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training
Hsu, Wei-Ning
Sriram, Anuroop
Baevski, Alexei
Likhomanenko, Tatiana
Xu, Qiantong
Pratap, Vineel
Kahn, Jacob
Lee, Ann
Collobert, Ronan
Synnaeve, Gabriel
Auli, Michael
INTERSPEECH 2021, 2021, : 721 - 725
[6] WavFusion: Towards Wav2vec 2.0 Multimodal Speech Emotion Recognition
Li, Feng
Luo, Jiusong
Xia, Wanjun
MULTIMEDIA MODELING, MMM 2025, PT IV, 2025, 15523 : 325 - 336
[7] Detection of Prosodic Boundaries in Speech Using Wav2Vec 2.0
Kunesova, Marie
Rezackova, Marketa
TEXT, SPEECH, AND DIALOGUE (TSD 2022), 2022, 13502 : 377 - 388
[8] Combining wav2vec 2.0 Fine-Tuning and ConLearnNet for Speech Emotion Recognition
Sun, Chenjing
Zhou, Yi
Huang, Xin
Yang, Jichen
Hou, Xianhua
ELECTRONICS, 2024, 13 (06)
[9] Speech Emotion Recognition Based on Shallow Structure of Wav2vec 2.0 and Attention Mechanism
Zhang, Yumei
Jia, Maoshen
Cao, Xuan
Zhao, Zichen
2024 IEEE 14TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, ISCSLP 2024, 2024, : 398 - 402
[10] MULTI-LINGUAL MULTI-TASK SPEECH EMOTION RECOGNITION USING WAV2VEC 2.0
Sharma, Mayank
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6907 - 6911

← 1 2 3 4 5 →