Text-Only Domain Adaptation for End-to-End Speech Recognition through Down-Sampling Acoustic Representation

被引：0

作者：

Zhu, Jiaxu ^{[1
,3
,6
]}

Tong, Weinan ^{[1
]}

Xu, Yaoxun ^{[1
]}

Song, Changhe ^{[1
,2
]}

Wu, Zhiyong ^{[1
,2
]}

You, Zhao ^{[3
]}

Su, Dan ^{[3
]}

Yu, Dong ^{[4
]}

Meng, Helen ^{[5
]}

机构：

[1] Tsinghua Univ, Shenzhen Int Grad Sch, Shenzhen, Peoples R China

[2] Peng Cheng Lab, Shenzhen, Peoples R China

[3] Tencent AI Lab, Shenzhen, Peoples R China

[4] Tencent AI Lab, Bellevue, WA USA

[5] Chinese Univ Hong Kong, Hong Kong, Peoples R China

[6] Tencent Inc, Shenzhen, Peoples R China

来源：

INTERSPEECH 2023 | 2023年

关键词：

Speech Recognition; Text-Only; Continuous Integrate and Fire; Domain Adaption;

D O I：

10.21437/Interspeech.2023-1378

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Mapping two modalities, speech and text, into a shared representation space, is a research topic of using text-only data to improve end-to-end automatic speech recognition (ASR) performance in new domains. However, the length of speech representation and text representation is inconsistent. Although the previous method up-samples the text representation to align with acoustic modality, it may not match the expected actual duration. In this paper, we proposed novel representations match strategy through down-sampling acoustic representation to align with text modality. By introducing a continuous integrate-and-fire (CIF) module generating acoustic representations consistent with token length, our ASR model can learn unified representations from both modalities better, allowing for domain adaptation using text-only data of the target domain. Experiment results of new domain data demonstrate the effectiveness of the proposed method.

引用

页码：1334 / 1338

页数：5

共 50 条

[31] Deep End-to-End Representation Learning for Food Type Recognition from Speech
Sertolli, Benjamin
Cummins, Nicholas
Sengur, Abdulkadir
Schuller, Bjorn W.
ICMI'18: PROCEEDINGS OF THE 20TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2018, : 574 - 578
[32] End-to-End Integration of Speech Recognition, Speech Enhancement, and Self-Supervised Learning Representation
Chang, Xuankai
Maekaku, Takashi
Fujita, Yuya
Watanabe, Shinji
INTERSPEECH 2022, 2022, : 3819 - 3823
[33] Acoustic Data-Driven Subword Modeling for End-to-End Speech Recognition
Zhou, Wei
Zeineldeen, Mohammad
Zheng, Zuoyun
Schlueter, Ralf
Ney, Hermann
INTERSPEECH 2021, 2021, : 2886 - 2890
[34] Generic Indic Text-to-speech Synthesisers with Rapid Adaptation in an End-to-end Framework
Prakash, Anusha
Murthy, Hema A.
INTERSPEECH 2020, 2020, : 2962 - 2966
[35] Optimization for Low-Resource Speaker Adaptation in End-to-End Text-to-Speech
Hong, Changi
Lee, Jung Hyuk
Jeon, Moongu
Kim, Hong Kook
2024 IEEE 21ST CONSUMER COMMUNICATIONS & NETWORKING CONFERENCE, CCNC, 2024, : 1060 - 1061
[36] M-Adapter: Modality Adaptation for End-to-End Speech-to-Text Translation
Zhao, Jinming
Yang, Hao
Shareghi, Ehsan
Haffari, Gholamreza
INTERSPEECH 2022, 2022, : 111 - 115
[37] Adversarial Learning of Intermediate Acoustic Feature for End-to-End Lightweight Text-to-Speech
Yoon, Hyungchan
Um, Seyun
Kim, Changhwan
Kang, Hong-Goo
INTERSPEECH 2023, 2023, : 3023 - 3027
[38] Efficient Adaptation of Spoken Language Understanding based on End-to-End Automatic Speech Recognition
Kim, Eesung
Jajodia, Aditya
Tseng, Cindy
Neelagiri, Divya
Ki, Taeyeon
Apsingekar, Vijendra Raj
INTERSPEECH 2023, 2023, : 3959 - 3963
[39] Personality-aware Training based Speaker Adaptation for End-to-end Speech Recognition
Gu, Yue
Du, Zhihao
Zhang, Shiliang
Chen, Qian
Han, Jiqing
INTERSPEECH 2023, 2023, : 1249 - 1253
[40] End-to-End Automatic Speech Recognition with a Reconstruction Criterion Using Speech-to-Text and Text-to-Speech Encoder-Decoders
Masumura, Ryo
Sato, Hiroshi
Tanaka, Tomohiro
Moriya, Takafumi
Ijima, Yusuke
Oba, Takanobu
INTERSPEECH 2019, 2019, : 1606 - 1610

← 1 2 3 4 5 →