Label-Synchronous Neural Transducer for Adaptable Online E2E Speech Recognition

被引:0
|
作者
Deng, Keqi [1 ]
Woodland, Philip C. [1 ]
机构
[1] Univ Cambridge, Dept Engn, Cambridge CB2 1TN, England
基金
英国工程与自然科学研究理事会;
关键词
Domain adaptation; E2E ASR; neural transducer; LANGUAGE MODEL; RNN-TRANSDUCER; ARCHITECTURE; TRANSFORMER;
D O I
10.1109/TASLP.2024.3419421
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Although end-to-end (E2E) automatic speech recognition (ASR) has shown state-of-the-art recognition accuracy, it tends to be implicitly biased towards the training data distribution which can degrade generalisation. This paper proposes a label-synchronous neural transducer (LS-Transducer), which provides a natural approach to domain adaptation based on text-only data. The LS-Transducer extracts a label-level encoder representation before combining it with the prediction network output. Since blank tokens are no longer needed, the prediction network performs as a standard language model, which can be easily adapted using text-only data. An Auto-regressive Integrate-and-Fire (AIF) mechanism is proposed to generate the label-level encoder representation while retaining low latency operation that can be used for streaming. In addition, a streaming joint decoding method is designed to improve ASR accuracy while retaining synchronisation with AIF. Experiments show that compared to standard neural transducers, the proposed LS-Transducer gave a 12.9% relative WER reduction (WERR) for intra-domain LibriSpeech data, as well as 21.4% and 24.6% relative WERRs on cross-domain TED-LIUM 2 and AESRC2020 data with an adapted prediction network.
引用
收藏
页码:3507 / 3516
页数:10
相关论文
共 25 条
  • [21] Online Service Function Chain Planning for Satellite-Ground Integrated Networks to Minimize End-to-End (E2E) Delay
    Kim, Soohyeong
    Park, Joohan
    Youn, Jiseung
    Ahn, Seyoung
    Cho, Sunghyun
    SENSORS, 2024, 24 (22)
  • [22] E2E-based Multi-task Learning Approach to Joint Speech and Accent Recognition
    Zhang, Jicheng
    Peng, Yizhou
    Pham, Van Tung
    Xu, Haihua
    Huang, Hao
    Chng, Eng Siong
    INTERSPEECH 2021, 2021, : 1519 - 1523
  • [23] E2E-DASR: End-to-end deep learning-based dysarthric automatic speech recognition
    Almadhor, Ahmad
    Irfan, Rizwana
    Gao, Jiechao
    Saleem, Nasir
    Rauf, Hafiz Tayyab
    Kadry, Seifedine
    EXPERT SYSTEMS WITH APPLICATIONS, 2023, 222
  • [24] E2E-V2SResNet: Deep residual convolutional neural networks for end-to-end video driven speech synthesis
    Saleem, Nasir
    Gao, Jiechao
    Irfan, Muhammad
    Verdu, Elena
    Fuente, Javier Parra
    IMAGE AND VISION COMPUTING, 2022, 119
  • [25] E2-capsule neural networks for facial expression recognition using AU-aware attention
    Cao, Shan
    Yao, Yuqian
    An, Gaoyun
    IET IMAGE PROCESSING, 2020, 14 (11) : 2417 - 2424