Cross-Modality Diffusion Modeling and Sampling for Speech Recognition

被引:0
|
作者
Yeh, Chia-Kai [1 ]
Chen, Chih-Chun [1 ]
Hsu, Ching-Hsieh [1 ]
Chien, Jen-Tzung [1 ]
机构
[1] Natl Yang Ming Chiao Tung Univ, Inst Elect & Comp Engn, Hsinchu, Taiwan
来源
关键词
speech recognition; diffusion model; feature decorrelation; fast sampling;
D O I
10.21437/Interspeech.2024-1898
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The diffusion model excels as a generative model for continuous data within a single modality. To extend its effectiveness to speech recognition, where the continuous speech frames are used as the condition to generate the discrete word tokens, building a conditional diffusion across discrete state space becomes crucial. This paper introduces a non-autoregressive discrete diffusion model, enabling parallel generation of a word string corresponding to a speech signal through iterative diffusion steps. An acoustic transformer encoder identifies the speech representation, serving as the condition for a denoising transformer decoder to predict the whole discrete sequence. To address the redundancy reduction in cross-modality diffusion, an additional feature decorrelation objective is integrated during optimization. This paper further reduces the inference time by using a fast sampling approach. The experiments on speech recognition illustrate the merit of the proposed method.
引用
收藏
页码:3924 / 3928
页数:5
相关论文
共 50 条
  • [21] Pedestrian Recognition through Different Cross-Modality Deep Learning Methods
    Pop, Danut Ovidiu
    Rogozan, Alexandrina
    Nashashibi, Fawzi
    Bensrhair, Abdelaziz
    2017 IEEE INTERNATIONAL CONFERENCE ON VEHICULAR ELECTRONICS AND SAFETY (ICVES), 2017, : 133 - 138
  • [22] Ship detection and recognition in SAR images with cross-modality domain adaption
    Song Y.
    Li J.
    Tian T.
    Tian J.
    Huazhong Keji Daxue Xuebao (Ziran Kexue Ban)/Journal of Huazhong University of Science and Technology (Natural Science Edition), 2022, 50 (11): : 107 - 113
  • [23] MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition
    Cheng, Xize
    Jin, Tao
    Huang, Rongjie
    Li, Linjun
    Lin, Wang
    Wang, Zehan
    Wang, Ye
    Liu, Huadai
    Yin, Aoxiong
    Zhao, Zhou
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15689 - 15699
  • [24] Self-attention Cross-modality Fusion Network for Cross-modality Person Re-identification
    Du P.
    Song Y.-H.
    Zhang X.-Y.
    Zidonghua Xuebao/Acta Automatica Sinica, 2022, 48 (06): : 1457 - 1468
  • [25] Biologically motivated cross-modality sensory fusion system for automatic target recognition
    Huntsberger, T
    NEURAL NETWORKS, 1995, 8 (7-8) : 1215 - 1226
  • [26] Recognition by association: Within- and cross-modality associative priming with faces and voices
    Stevenage, Sarah V.
    Hale, Sarah
    Morgan, Yasmin
    Neil, Greg J.
    BRITISH JOURNAL OF PSYCHOLOGY, 2014, 105 (01) : 1 - 16
  • [27] Heterogeneous Face Recognition by Margin-Based Cross-Modality Metric Learning
    Huo, Jing
    Gao, Yang
    Shi, Yinghuan
    Yang, Wanqi
    Yin, Hujun
    IEEE TRANSACTIONS ON CYBERNETICS, 2018, 48 (06) : 1814 - 1826
  • [28] ON PREDICTING EXPONENTS FOR CROSS-MODALITY MATCHES
    STEVENS, SS
    PERCEPTION & PSYCHOPHYSICS, 1969, 6 (04): : 251 - &
  • [29] Prior expectations in cross-modality matching
    Laming, D
    MATHEMATICAL SOCIAL SCIENCES, 1999, 38 (03) : 343 - 359
  • [30] CROSS-MODALITY HASHING WITH PARTIAL CORRESPONDENCE
    Gu, Yun
    Xue, Haoyang
    Yang, Jie
    Shi, Pengfei
    2015 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2015, : 1925 - 1929