Text-Only Domain Adaptation for End-to-End Speech Recognition through Down-Sampling Acoustic Representation

被引:0
|
作者
Zhu, Jiaxu [1 ,3 ,6 ]
Tong, Weinan [1 ]
Xu, Yaoxun [1 ]
Song, Changhe [1 ,2 ]
Wu, Zhiyong [1 ,2 ]
You, Zhao [3 ]
Su, Dan [3 ]
Yu, Dong [4 ]
Meng, Helen [5 ]
机构
[1] Tsinghua Univ, Shenzhen Int Grad Sch, Shenzhen, Peoples R China
[2] Peng Cheng Lab, Shenzhen, Peoples R China
[3] Tencent AI Lab, Shenzhen, Peoples R China
[4] Tencent AI Lab, Bellevue, WA USA
[5] Chinese Univ Hong Kong, Hong Kong, Peoples R China
[6] Tencent Inc, Shenzhen, Peoples R China
来源
关键词
Speech Recognition; Text-Only; Continuous Integrate and Fire; Domain Adaption;
D O I
10.21437/Interspeech.2023-1378
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Mapping two modalities, speech and text, into a shared representation space, is a research topic of using text-only data to improve end-to-end automatic speech recognition (ASR) performance in new domains. However, the length of speech representation and text representation is inconsistent. Although the previous method up-samples the text representation to align with acoustic modality, it may not match the expected actual duration. In this paper, we proposed novel representations match strategy through down-sampling acoustic representation to align with text modality. By introducing a continuous integrate-and-fire (CIF) module generating acoustic representations consistent with token length, our ASR model can learn unified representations from both modalities better, allowing for domain adaptation using text-only data of the target domain. Experiment results of new domain data demonstrate the effectiveness of the proposed method.
引用
收藏
页码:1334 / 1338
页数:5
相关论文
共 50 条
  • [41] Modular End-to-End Automatic Speech Recognition Framework for Acoustic-to-Word Model
    Liu, Qi
    Chen, Zhehuai
    Li, Hao
    Huang, Mingkun
    Lu, Yizhou
    Yu, Kai
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 : 2174 - 2183
  • [42] End-to-End Training of Acoustic Models for Large Vocabulary Continuous Speech Recognition with TensorFlow
    Variani, Ehsan
    Bagby, Tom
    McDermott, Erik
    Bacchiani, Michiel
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 1641 - 1645
  • [43] End-to-End Speech Emotion Recognition Combined with Acoustic-to-Word ASR Model
    Feng, Han
    Ueno, Sei
    Kawahara, Tatsuya
    INTERSPEECH 2020, 2020, : 501 - 505
  • [44] An End-to-End Deep Learning Approach to Simultaneous Speech Dereverberation and Acoustic Modeling for Robust Speech Recognition
    Wu, Bo
    Li, Kehuang
    Ge, Fengpei
    Huang, Zhen
    Yang, Minglei
    Siniscalchi, Sabato Marco
    Lee, Chin-Hui
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2017, 11 (08) : 1289 - 1300
  • [45] Accented Speech Recognition Based on End-to-End Domain Adversarial Training of Neural Networks
    Na, Hyeong-Ju
    Park, Jeong-Sik
    APPLIED SCIENCES-BASEL, 2021, 11 (18):
  • [46] IMPROVING CONFIDENCE ESTIMATION ON OUT-OF-DOMAIN DATA FOR END-TO-END SPEECH RECOGNITION
    Li, Qiujia
    Zhang, Yu
    Qiu, David
    He, Yanzhang
    Cao, Liangliang
    Woodland, Philip C.
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6537 - 6541
  • [47] INTERNAL LANGUAGE MODEL ESTIMATION FOR DOMAIN-ADAPTIVE END-TO-END SPEECH RECOGNITION
    Meng, Zhong
    Parthasarathy, Sarangarajan
    Sun, Eric
    Gaur, Yashesh
    Kanda, Naoyuki
    Lu, Liang
    Chen, Xie
    Zhao, Rui
    Li, Jinyu
    Gong, Yifan
    2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 243 - 250
  • [48] INTERNAL LANGUAGE MODEL TRAINING FOR DOMAIN-ADAPTIVE END-TO-END SPEECH RECOGNITION
    Meng, Zhong
    Kanda, Naoyuki
    Gaur, Yashesh
    Parthasarathy, Sarangarajan
    Sun, Eric
    Lu, Liang
    Chen, Xie
    Li, Jinyu
    Gong, Yifan
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7338 - 7342
  • [49] Gammatonegram representation for end-to-end dysarthric speech processing tasks: speech recognition, speaker identification, and intelligibility assessment
    Aref Farhadipour
    Hadi Veisi
    Iran Journal of Computer Science, 2024, 7 (2) : 311 - 324
  • [50] END-TO-END INTEGRATION OF SPEECH RECOGNITION, DEREVERBERATION, BEAMFORMING, AND SELF-SUPERVISED LEARNING REPRESENTATION
    Masuyama, Yoshiki
    Chang, Xuankai
    Cornell, Samuele
    Watanabe, Shinji
    Ono, Nobutaka
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 260 - 265