An improved wav2vec 2.0 pre-training approach using enhanced local dependency modeling for speech recognition

被引：2

作者：

Zhu, Qiu-shi ^{[1
]}

Zhang, Jie ^{[1
]}

Wu, Ming-hui ^{[2
]}

Fang, Xin ^{[1
,2
]}

Dai, Li-Rong ^{[1
]}

机构：

[1] Univ Sci & Technol China USTC, NEL SLIP, Hefei, Peoples R China

[2] iFlytek Co Ltd, iFlytek Res, Hefei, Peoples R China

来源：

INTERSPEECH 2021 | 2021年

基金：

国家重点研发计划;

关键词：

Speech recognition; pre-training; wav2vec; 2.0; transformer; low-resource; local and global dependence; TRANSFORMER;

D O I：

10.21437/Interspeech.2021-67

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Wav2vec 2.0 is a recently proposed self-supervised pre-training framework for learning speech representation. It utilizes a transformer to learn global contextual representation, which is effective especially in low-resource scenarios. Besides, it was shown that combining convolution neural network and transformer to model both local and global dependencies is beneficial for e.g., automatic speech recognition (ASR), natural language processing (NLP). However, how to model the local and global dependence in pre-training models is still an open question in the speech domain. In this paper, we therefore propose a new transformer encoder for enhancing the local dependency by combining convolution and self-attention modules. The transformer encoder first parallels the convolution and self-attention modules, and then serialized with another convolution module, sandwiched by a pair of feed forward modules. Experimental results show that the pre-trained model using the proposed method can reduce the word error rate (WER) compared to the reproduced wav2vec 2.0 at the cost of slightly increasing the size of training parameters.

引用

页码：4334 / 4338

页数：5

共 50 条

[21] Wav2f0: Exploring the Potential of Wav2vec 2.0 for Speech Fundamental Frequency Extraction
Feng, Rui
Liu, Yin-Long
Ling, Zhen-Hua
Yuan, Jia-Hong
2024 IEEE 14TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, ISCSLP 2024, 2024, : 169 - 173
[22] Exploring the potential of Wav2vec 2.0 for speech emotion recognition using classifier combination and attention-based feature fusion
Nasersharif, Babak
Namvarpour, Mohammad
JOURNAL OF SUPERCOMPUTING, 2024, 80 (16): : 23667 - 23688
[23] Multi-level Fusion of Wav2vec 2.0 and BERT for Multimodal Emotion Recognition
Zhao, Zihan
Wang, Yanfeng
Wang, Yu
INTERSPEECH 2022, 2022, : 4725 - 4729
[24] Improving speech depression detection using transfer learning with wav2vec 2.0 in low-resource environments
Zhang, Xu
Zhang, Xiangcheng
Chen, Weisi
Li, Chenlong
Yu, Chengyuan
SCIENTIFIC REPORTS, 2024, 14 (01):
[25] BrainTalker: Low-Resource Brain-to-Speech Synthesis with Transfer Learning using Wav2Vec 2.0
Kim, Miseul
Piao, Zhenyu
Lee, Jihyun
Kang, Hong-Goo
2023 IEEE EMBS INTERNATIONAL CONFERENCE ON BIOMEDICAL AND HEALTH INFORMATICS, BHI, 2023,
[26] PROFICIENCY ASSESSMENT OF L2 SPOKEN ENGLISH USING WAV2VEC 2.0
Banno, Stefano
Matassoni, Marco
2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 1088 - 1095
[27] CTRL: Continual Representation Learning to Transfer Information of Pre-trained for WAV2VEC 2.0
Lee, Jae-Hong
Lee, Chae-Won
Choi, Jin-Seong
Chang, Joon-Hyuk
Seong, Woo Kyeong
Lee, Jeonghan
INTERSPEECH 2022, 2022, : 3398 - 3402
[28] Applying the conformal prediction paradigm for the uncertainty quantification of an end-to-end automatic speech recognition model (wav2vec 2.0)
Ernez, Fares
Arnold, Alexandre
Galametz, Audrey
Kobus, Catherine
Ould-Amer, Nawal
CONFORMAL AND PROBABILISTIC PREDICTION WITH APPLICATIONS, VOL 204, 2023, 204 : 16 - 35
[29] Kazakh Speech Recognition: Wav2vec2.0 vs. Whisper
Kozhirbayev, Zhanibek
JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, 2023, 14 (06) : 1382 - 1389
[30] A Novel Multi-Feature Fusion Model Based on Pre-Trained Wav2vec 2.0 for Underwater Acoustic Target Recognition
Pu, Zijun
Zhang, Qunfei
Xue, Yangtao
Zhu, Peican
Cui, Xiaodong
REMOTE SENSING, 2024, 16 (13)

← 1 2 3 4 5 →