Label-Synchronous Neural Transducer for Adaptable Online E2E Speech Recognition

被引：0

作者：

Deng, Keqi ^{[1
]}

Woodland, Philip C. ^{[1
]}

机构：

[1] Univ Cambridge, Dept Engn, Cambridge CB2 1TN, England

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2024年 / 32卷

基金：

英国工程与自然科学研究理事会;

关键词：

Domain adaptation; E2E ASR; neural transducer; LANGUAGE MODEL; RNN-TRANSDUCER; ARCHITECTURE; TRANSFORMER;

D O I：

10.1109/TASLP.2024.3419421

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Although end-to-end (E2E) automatic speech recognition (ASR) has shown state-of-the-art recognition accuracy, it tends to be implicitly biased towards the training data distribution which can degrade generalisation. This paper proposes a label-synchronous neural transducer (LS-Transducer), which provides a natural approach to domain adaptation based on text-only data. The LS-Transducer extracts a label-level encoder representation before combining it with the prediction network output. Since blank tokens are no longer needed, the prediction network performs as a standard language model, which can be easily adapted using text-only data. An Auto-regressive Integrate-and-Fire (AIF) mechanism is proposed to generate the label-level encoder representation while retaining low latency operation that can be used for streaming. In addition, a streaming joint decoding method is designed to improve ASR accuracy while retaining synchronisation with AIF. Experiments show that compared to standard neural transducers, the proposed LS-Transducer gave a 12.9% relative WER reduction (WERR) for intra-domain LibriSpeech data, as well as 21.4% and 24.6% relative WERRs on cross-domain TED-LIUM 2 and AESRC2020 data with an adapted prediction network.

引用

页码：3507 / 3516

页数：10

共 25 条

[21] Online Service Function Chain Planning for Satellite-Ground Integrated Networks to Minimize End-to-End (E2E) Delay
Kim, Soohyeong
Park, Joohan
Youn, Jiseung
Ahn, Seyoung
Cho, Sunghyun
SENSORS, 2024, 24 (22)
[22] E2E-based Multi-task Learning Approach to Joint Speech and Accent Recognition
Zhang, Jicheng
Peng, Yizhou
Pham, Van Tung
Xu, Haihua
Huang, Hao
Chng, Eng Siong
INTERSPEECH 2021, 2021, : 1519 - 1523
[23] E2E-DASR: End-to-end deep learning-based dysarthric automatic speech recognition
Almadhor, Ahmad
Irfan, Rizwana
Gao, Jiechao
Saleem, Nasir
Rauf, Hafiz Tayyab
Kadry, Seifedine
EXPERT SYSTEMS WITH APPLICATIONS, 2023, 222
[24] E2E-V2SResNet: Deep residual convolutional neural networks for end-to-end video driven speech synthesis
Saleem, Nasir
Gao, Jiechao
Irfan, Muhammad
Verdu, Elena
Fuente, Javier Parra
IMAGE AND VISION COMPUTING, 2022, 119
[25] E2-capsule neural networks for facial expression recognition using AU-aware attention
Cao, Shan
Yao, Yuqian
An, Gaoyun
IET IMAGE PROCESSING, 2020, 14 (11) : 2417 - 2424

← 1 2 3 →