Text-Only Domain Adaptation for End-to-End Speech Recognition through Down-Sampling Acoustic Representation

被引:0
|
作者
Zhu, Jiaxu [1 ,3 ,6 ]
Tong, Weinan [1 ]
Xu, Yaoxun [1 ]
Song, Changhe [1 ,2 ]
Wu, Zhiyong [1 ,2 ]
You, Zhao [3 ]
Su, Dan [3 ]
Yu, Dong [4 ]
Meng, Helen [5 ]
机构
[1] Tsinghua Univ, Shenzhen Int Grad Sch, Shenzhen, Peoples R China
[2] Peng Cheng Lab, Shenzhen, Peoples R China
[3] Tencent AI Lab, Shenzhen, Peoples R China
[4] Tencent AI Lab, Bellevue, WA USA
[5] Chinese Univ Hong Kong, Hong Kong, Peoples R China
[6] Tencent Inc, Shenzhen, Peoples R China
来源
关键词
Speech Recognition; Text-Only; Continuous Integrate and Fire; Domain Adaption;
D O I
10.21437/Interspeech.2023-1378
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Mapping two modalities, speech and text, into a shared representation space, is a research topic of using text-only data to improve end-to-end automatic speech recognition (ASR) performance in new domains. However, the length of speech representation and text representation is inconsistent. Although the previous method up-samples the text representation to align with acoustic modality, it may not match the expected actual duration. In this paper, we proposed novel representations match strategy through down-sampling acoustic representation to align with text modality. By introducing a continuous integrate-and-fire (CIF) module generating acoustic representations consistent with token length, our ASR model can learn unified representations from both modalities better, allowing for domain adaptation using text-only data of the target domain. Experiment results of new domain data demonstrate the effectiveness of the proposed method.
引用
收藏
页码:1334 / 1338
页数:5
相关论文
共 50 条
  • [1] Internal Language Model Adaptation with Text-Only Data for End-to-End Speech Recognition
    Meng, Zhong
    Gaur, Yashesh
    Kanda, Naoyuki
    Li, Jinyu
    Chen, Xie
    Wu, Yu
    Gong, Yifan
    INTERSPEECH 2022, 2022, : 2608 - 2612
  • [2] Integrating Knowledge Into End-to-End Speech Recognition From External Text-Only Data
    Bai, Ye
    Yi, Jiangyan
    Tao, Jianhua
    Wen, Zhengqi
    Tian, Zhengkun
    Zhang, Shuai
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 1340 - 1351
  • [3] Text Only Domain Adaptation with Phoneme Guided Data Splicing for End-to-End Speech Recognition
    Wang, Wei
    Gong, Xun
    Shao, Hang
    Yang, Dongning
    Qian, Yanmin
    INTERSPEECH 2023, 2023, : 3347 - 3351
  • [4] Text-only domain adaptation for end-to-end ASR using integrated text-to-mel-spectrogram generator
    Bataev, Vladimir
    Korostik, Roman
    Shabalin, Evgeny
    Lavrukhin, Vitaly
    Ginsburg, Boris
    INTERSPEECH 2023, 2023, : 2928 - 2932
  • [5] Text-only Domain Adaptation using Unified Speech-Text Representation in Transducer
    Huang, Lu
    Li, Boyu
    Zhang, Jun
    Lu, Lu
    Ma, Zejun
    INTERSPEECH 2023, 2023, : 386 - 390
  • [6] On multi-domain training and adaptation of end-to-end RNN acoustic models for distant speech recognition
    Mirsamadi, Seyedmandad
    Hansen, John H. L.
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 404 - 408
  • [7] DOMAIN ADAPTATION OF END-TO-END SPEECH RECOGNITION IN LOW-RESOURCE SETTINGS
    Samarakoon, Lahiru
    Mak, Brian
    Lam, Albert Y. S.
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 382 - 388
  • [8] End-to-end Speech-to-Punctuated-Text Recognition
    Nozaki, Jumon
    Kawahara, Tatsuya
    Ishizuka, Kenkichi
    Hashimoto, Taiichi
    INTERSPEECH 2022, 2022, : 1811 - 1815
  • [9] Speech Representation Learning for Emotion Recognition Using End-to-End ASR with Factorized Adaptation
    Yeh, Sung-Lin
    Lin, Yun-Shao
    Lee, Chi-Chun
    INTERSPEECH 2020, 2020, : 536 - 540
  • [10] SPEAKER ADAPTATION FOR MULTICHANNEL END-TO-END SPEECH RECOGNITION
    Ochiai, Tsubasa
    Watanabe, Shinji
    Katagiri, Shigeru
    Hori, Takaaki
    Hershey, John
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 6707 - 6711