Cross-Modality Diffusion Modeling and Sampling for Speech Recognition

被引:0
|
作者
Yeh, Chia-Kai [1 ]
Chen, Chih-Chun [1 ]
Hsu, Ching-Hsieh [1 ]
Chien, Jen-Tzung [1 ]
机构
[1] Natl Yang Ming Chiao Tung Univ, Inst Elect & Comp Engn, Hsinchu, Taiwan
来源
关键词
speech recognition; diffusion model; feature decorrelation; fast sampling;
D O I
10.21437/Interspeech.2024-1898
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The diffusion model excels as a generative model for continuous data within a single modality. To extend its effectiveness to speech recognition, where the continuous speech frames are used as the condition to generate the discrete word tokens, building a conditional diffusion across discrete state space becomes crucial. This paper introduces a non-autoregressive discrete diffusion model, enabling parallel generation of a word string corresponding to a speech signal through iterative diffusion steps. An acoustic transformer encoder identifies the speech representation, serving as the condition for a denoising transformer decoder to predict the whole discrete sequence. To address the redundancy reduction in cross-modality diffusion, an additional feature decorrelation objective is integrated during optimization. This paper further reduces the inference time by using a fast sampling approach. The experiments on speech recognition illustrate the merit of the proposed method.
引用
收藏
页码:3924 / 3928
页数:5
相关论文
共 50 条
  • [31] DIETETIC SERVICES IN A CROSS-MODALITY SYSTEM
    MODROW, CL
    DARNELL, RE
    JOURNAL OF THE AMERICAN DIETETIC ASSOCIATION, 1979, 74 (03) : 341 - 344
  • [32] Cross-modality Neuroimage Synthesis: A Survey
    Xie, Guoyang
    Huang, Yawen
    Wang, Jinbao
    Lyu, Jiayi
    Zheng, Feng
    Zheng, Yefeng
    Jin, Yaochu
    ACM COMPUTING SURVEYS, 2024, 56 (03)
  • [33] A Cross-Modality Perspective On Verb Agreement
    Irit Meir
    Natural Language & Linguistic Theory, 2002, 20 : 413 - 450
  • [34] CROSS-MODALITY TRANSFER OF SPATIAL INFORMATION
    FISHBEIN, HD
    DECKER, J
    WILCOX, P
    BRITISH JOURNAL OF PSYCHOLOGY, 1977, 68 (NOV) : 503 - 508
  • [35] Cross-Modality Wood Log Tracing
    Wimmer, Georg
    Schraml, Rudolf
    Lamminger, Lukas
    Petutschnigg, Alexander
    Uhl, Andreas
    23RD IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA (ISM 2021), 2021, : 191 - 195
  • [36] SUBJECT DIFFERENCES IN CROSS-MODALITY MATCHING
    RULE, SJ
    MARKLEY, RP
    PERCEPTION & PSYCHOPHYSICS, 1971, 9 (1B): : 115 - &
  • [37] CROSS-MODALITY MATCHING OF BRIGHTNESS AND LOUDNESS
    STEVENS, JC
    MARKS, LE
    PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1965, 54 (02) : 407 - &
  • [38] Boosting Cross-Modality Image Registration
    Barbu, Adrian
    Ionasec, Razvan
    2009 JOINT URBAN REMOTE SENSING EVENT, VOLS 1-3, 2009, : 89 - +
  • [39] CROSS-MODALITY MASKING FOR TOUCH AND HEARING
    GESCHEIDER, GA
    NIBLETTE, RK
    JOURNAL OF EXPERIMENTAL PSYCHOLOGY, 1967, 74 (03): : 313 - +
  • [40] A cross-modality perspective on verb agreement
    Meir, I
    NATURAL LANGUAGE & LINGUISTIC THEORY, 2002, 20 (02) : 413 - 450