Cross-Modality Diffusion Modeling and Sampling for Speech Recognition

被引：0

作者：

Yeh, Chia-Kai ^{[1
]}

Chen, Chih-Chun ^{[1
]}

Hsu, Ching-Hsieh ^{[1
]}

Chien, Jen-Tzung ^{[1
]}

机构：

[1] Natl Yang Ming Chiao Tung Univ, Inst Elect & Comp Engn, Hsinchu, Taiwan

来源：

INTERSPEECH 2024 | 2024年

关键词：

speech recognition; diffusion model; feature decorrelation; fast sampling;

D O I：

10.21437/Interspeech.2024-1898

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The diffusion model excels as a generative model for continuous data within a single modality. To extend its effectiveness to speech recognition, where the continuous speech frames are used as the condition to generate the discrete word tokens, building a conditional diffusion across discrete state space becomes crucial. This paper introduces a non-autoregressive discrete diffusion model, enabling parallel generation of a word string corresponding to a speech signal through iterative diffusion steps. An acoustic transformer encoder identifies the speech representation, serving as the condition for a denoising transformer decoder to predict the whole discrete sequence. To address the redundancy reduction in cross-modality diffusion, an additional feature decorrelation objective is integrated during optimization. This paper further reduces the inference time by using a fast sampling approach. The experiments on speech recognition illustrate the merit of the proposed method.

引用

页码：3924 / 3928

页数：5

共 50 条

[41] Representation Learning for Cross-Modality Classification
van Tulder, Gijs
de Bruijne, Marleen
MEDICAL COMPUTER VISION AND BAYESIAN AND GRAPHICAL MODELS FOR BIOMEDICAL IMAGING, 2017, 10081 : 126 - 136
[42] CROSS-MODALITY MATCHING OF NUMEROSITY AND PITCH
ABBEY, DS
CANADIAN JOURNAL OF PSYCHOLOGY, 1962, 16 (04): : 283 - 290
[43] The therapeutic encounter: A cross-modality approach
Hollanders, Henry
EUROPEAN JOURNAL OF PSYCHOTHERAPY & COUNSELLING, 2012, 14 (03) : 301 - 303
[44] CROSS-MODALITY MATCHING IN DECISION MAKING
HICKS, RG
JOURNAL OF AUDITORY RESEARCH, 1969, 9 (03): : 200 - 206
[45] Cross-Modality Learning by Exploring Modality Interactions for Emotion Reasoning
Tran, Thi-Dung
Ho, Ngoc-Huynh
Pant, Sudarshan
Yang, Hyung-Jeong
Kim, Soo-Hyung
Lee, Gueesang
IEEE ACCESS, 2023, 11 : 56634 - 56648
[46] Fine-grained cross-modality consistency mining for Continuous Sign Language Recognition
Ke, Zhenghao
Liu, Sheng
Feng, Yuan
PATTERN RECOGNITION LETTERS, 2025, 191 : 23 - 30
[47] Semi-Supervised Cross-Modality Action Recognition by Latent Tensor Transfer Learning
Jia, Chengcheng
Ding, Zhengming
Kong, Yu
Fu, Yun
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2020, 30 (09) : 2801 - 2814
[48] Cross-Modality Multi-Task Deep Metric Learning for Sketch Face Recognition
Feng, Yujian
Wu, Fei
Huang, Qinghua
Jing, Xiao-Yuan
Ji, Yimu
Yu, Jian
Chen, Feng
Han, Lu
2019 CHINESE AUTOMATION CONGRESS (CAC2019), 2019, : 2277 - 2281
[49] Target-Guided Diffusion Models for Unpaired Cross-Modality Medical Image Translation
Luo, Yimin
Yang, Qinyu
Liu, Ziyi
Shi, Zenglin
Huang, Weimin
Zheng, Guoyan
Cheng, Jun
IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2024, 28 (07) : 4062 - 4071
[50] Cross-Modality Compensation Convolutional Neural Networks for RGB-D Action Recognition
Cheng, Jun
Ren, Ziliang
Zhang, Qieshi
Gao, Xiangyang
Hao, Fusheng
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (03) : 1498 - 1509

← 1 2 3 4 5 →