Text-Only Domain Adaptation for End-to-End Speech Recognition through Down-Sampling Acoustic Representation

被引：0

作者：

Zhu, Jiaxu ^{[1
,3
,6
]}

Tong, Weinan ^{[1
]}

Xu, Yaoxun ^{[1
]}

Song, Changhe ^{[1
,2
]}

Wu, Zhiyong ^{[1
,2
]}

You, Zhao ^{[3
]}

Su, Dan ^{[3
]}

Yu, Dong ^{[4
]}

Meng, Helen ^{[5
]}

机构：

[1] Tsinghua Univ, Shenzhen Int Grad Sch, Shenzhen, Peoples R China

[2] Peng Cheng Lab, Shenzhen, Peoples R China

[3] Tencent AI Lab, Shenzhen, Peoples R China

[4] Tencent AI Lab, Bellevue, WA USA

[5] Chinese Univ Hong Kong, Hong Kong, Peoples R China

[6] Tencent Inc, Shenzhen, Peoples R China

来源：

INTERSPEECH 2023 | 2023年

关键词：

Speech Recognition; Text-Only; Continuous Integrate and Fire; Domain Adaption;

D O I：

10.21437/Interspeech.2023-1378

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Mapping two modalities, speech and text, into a shared representation space, is a research topic of using text-only data to improve end-to-end automatic speech recognition (ASR) performance in new domains. However, the length of speech representation and text representation is inconsistent. Although the previous method up-samples the text representation to align with acoustic modality, it may not match the expected actual duration. In this paper, we proposed novel representations match strategy through down-sampling acoustic representation to align with text modality. By introducing a continuous integrate-and-fire (CIF) module generating acoustic representations consistent with token length, our ASR model can learn unified representations from both modalities better, allowing for domain adaptation using text-only data of the target domain. Experiment results of new domain data demonstrate the effectiveness of the proposed method.

引用

页码：1334 / 1338

页数：5

共 50 条

[41] Modular End-to-End Automatic Speech Recognition Framework for Acoustic-to-Word Model
Liu, Qi
Chen, Zhehuai
Li, Hao
Huang, Mingkun
Lu, Yizhou
Yu, Kai
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 : 2174 - 2183
[42] End-to-End Training of Acoustic Models for Large Vocabulary Continuous Speech Recognition with TensorFlow
Variani, Ehsan
Bagby, Tom
McDermott, Erik
Bacchiani, Michiel
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 1641 - 1645
[43] End-to-End Speech Emotion Recognition Combined with Acoustic-to-Word ASR Model
Feng, Han
Ueno, Sei
Kawahara, Tatsuya
INTERSPEECH 2020, 2020, : 501 - 505
[44] An End-to-End Deep Learning Approach to Simultaneous Speech Dereverberation and Acoustic Modeling for Robust Speech Recognition
Wu, Bo
Li, Kehuang
Ge, Fengpei
Huang, Zhen
Yang, Minglei
Siniscalchi, Sabato Marco
Lee, Chin-Hui
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2017, 11 (08) : 1289 - 1300
[45] Accented Speech Recognition Based on End-to-End Domain Adversarial Training of Neural Networks
Na, Hyeong-Ju
Park, Jeong-Sik
APPLIED SCIENCES-BASEL, 2021, 11 (18):
[46] IMPROVING CONFIDENCE ESTIMATION ON OUT-OF-DOMAIN DATA FOR END-TO-END SPEECH RECOGNITION
Li, Qiujia
Zhang, Yu
Qiu, David
He, Yanzhang
Cao, Liangliang
Woodland, Philip C.
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6537 - 6541
[47] INTERNAL LANGUAGE MODEL ESTIMATION FOR DOMAIN-ADAPTIVE END-TO-END SPEECH RECOGNITION
Meng, Zhong
Parthasarathy, Sarangarajan
Sun, Eric
Gaur, Yashesh
Kanda, Naoyuki
Lu, Liang
Chen, Xie
Zhao, Rui
Li, Jinyu
Gong, Yifan
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 243 - 250
[48] INTERNAL LANGUAGE MODEL TRAINING FOR DOMAIN-ADAPTIVE END-TO-END SPEECH RECOGNITION
Meng, Zhong
Kanda, Naoyuki
Gaur, Yashesh
Parthasarathy, Sarangarajan
Sun, Eric
Lu, Liang
Chen, Xie
Li, Jinyu
Gong, Yifan
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7338 - 7342
[49] Gammatonegram representation for end-to-end dysarthric speech processing tasks: speech recognition, speaker identification, and intelligibility assessment
Aref Farhadipour
Hadi Veisi
Iran Journal of Computer Science, 2024, 7 (2) : 311 - 324
[50] END-TO-END INTEGRATION OF SPEECH RECOGNITION, DEREVERBERATION, BEAMFORMING, AND SELF-SUPERVISED LEARNING REPRESENTATION
Masuyama, Yoshiki
Chang, Xuankai
Cornell, Samuele
Watanabe, Shinji
Ono, Nobutaka
2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 260 - 265

← 1 2 3 4 5 →