Cross-Modality Diffusion Modeling and Sampling for Speech Recognition

被引:0
|
作者
Yeh, Chia-Kai [1 ]
Chen, Chih-Chun [1 ]
Hsu, Ching-Hsieh [1 ]
Chien, Jen-Tzung [1 ]
机构
[1] Natl Yang Ming Chiao Tung Univ, Inst Elect & Comp Engn, Hsinchu, Taiwan
来源
关键词
speech recognition; diffusion model; feature decorrelation; fast sampling;
D O I
10.21437/Interspeech.2024-1898
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The diffusion model excels as a generative model for continuous data within a single modality. To extend its effectiveness to speech recognition, where the continuous speech frames are used as the condition to generate the discrete word tokens, building a conditional diffusion across discrete state space becomes crucial. This paper introduces a non-autoregressive discrete diffusion model, enabling parallel generation of a word string corresponding to a speech signal through iterative diffusion steps. An acoustic transformer encoder identifies the speech representation, serving as the condition for a denoising transformer decoder to predict the whole discrete sequence. To address the redundancy reduction in cross-modality diffusion, an additional feature decorrelation objective is integrated during optimization. This paper further reduces the inference time by using a fast sampling approach. The experiments on speech recognition illustrate the merit of the proposed method.
引用
收藏
页码:3924 / 3928
页数:5
相关论文
共 50 条
  • [41] Representation Learning for Cross-Modality Classification
    van Tulder, Gijs
    de Bruijne, Marleen
    MEDICAL COMPUTER VISION AND BAYESIAN AND GRAPHICAL MODELS FOR BIOMEDICAL IMAGING, 2017, 10081 : 126 - 136
  • [42] CROSS-MODALITY MATCHING OF NUMEROSITY AND PITCH
    ABBEY, DS
    CANADIAN JOURNAL OF PSYCHOLOGY, 1962, 16 (04): : 283 - 290
  • [43] The therapeutic encounter: A cross-modality approach
    Hollanders, Henry
    EUROPEAN JOURNAL OF PSYCHOTHERAPY & COUNSELLING, 2012, 14 (03) : 301 - 303
  • [44] CROSS-MODALITY MATCHING IN DECISION MAKING
    HICKS, RG
    JOURNAL OF AUDITORY RESEARCH, 1969, 9 (03): : 200 - 206
  • [45] Cross-Modality Learning by Exploring Modality Interactions for Emotion Reasoning
    Tran, Thi-Dung
    Ho, Ngoc-Huynh
    Pant, Sudarshan
    Yang, Hyung-Jeong
    Kim, Soo-Hyung
    Lee, Gueesang
    IEEE ACCESS, 2023, 11 : 56634 - 56648
  • [46] Fine-grained cross-modality consistency mining for Continuous Sign Language Recognition
    Ke, Zhenghao
    Liu, Sheng
    Feng, Yuan
    PATTERN RECOGNITION LETTERS, 2025, 191 : 23 - 30
  • [47] Semi-Supervised Cross-Modality Action Recognition by Latent Tensor Transfer Learning
    Jia, Chengcheng
    Ding, Zhengming
    Kong, Yu
    Fu, Yun
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2020, 30 (09) : 2801 - 2814
  • [48] Cross-Modality Multi-Task Deep Metric Learning for Sketch Face Recognition
    Feng, Yujian
    Wu, Fei
    Huang, Qinghua
    Jing, Xiao-Yuan
    Ji, Yimu
    Yu, Jian
    Chen, Feng
    Han, Lu
    2019 CHINESE AUTOMATION CONGRESS (CAC2019), 2019, : 2277 - 2281
  • [49] Target-Guided Diffusion Models for Unpaired Cross-Modality Medical Image Translation
    Luo, Yimin
    Yang, Qinyu
    Liu, Ziyi
    Shi, Zenglin
    Huang, Weimin
    Zheng, Guoyan
    Cheng, Jun
    IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2024, 28 (07) : 4062 - 4071
  • [50] Cross-Modality Compensation Convolutional Neural Networks for RGB-D Action Recognition
    Cheng, Jun
    Ren, Ziliang
    Zhang, Qieshi
    Gao, Xiangyang
    Hao, Fusheng
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (03) : 1498 - 1509