Cross-Modality Diffusion Modeling and Sampling for Speech Recognition

被引:0
|
作者
Yeh, Chia-Kai [1 ]
Chen, Chih-Chun [1 ]
Hsu, Ching-Hsieh [1 ]
Chien, Jen-Tzung [1 ]
机构
[1] Natl Yang Ming Chiao Tung Univ, Inst Elect & Comp Engn, Hsinchu, Taiwan
来源
关键词
speech recognition; diffusion model; feature decorrelation; fast sampling;
D O I
10.21437/Interspeech.2024-1898
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The diffusion model excels as a generative model for continuous data within a single modality. To extend its effectiveness to speech recognition, where the continuous speech frames are used as the condition to generate the discrete word tokens, building a conditional diffusion across discrete state space becomes crucial. This paper introduces a non-autoregressive discrete diffusion model, enabling parallel generation of a word string corresponding to a speech signal through iterative diffusion steps. An acoustic transformer encoder identifies the speech representation, serving as the condition for a denoising transformer decoder to predict the whole discrete sequence. To address the redundancy reduction in cross-modality diffusion, an additional feature decorrelation objective is integrated during optimization. This paper further reduces the inference time by using a fast sampling approach. The experiments on speech recognition illustrate the merit of the proposed method.
引用
收藏
页码:3924 / 3928
页数:5
相关论文
共 50 条
  • [1] Audio-Visual Speech Recognition Based on Dual Cross-Modality Attentions with the Transformer Model
    Lee, Yong-Hyeok
    Jang, Dong-Won
    Kim, Jae-Bin
    Park, Rae-Hong
    Park, Hyung-Min
    APPLIED SCIENCES-BASEL, 2020, 10 (20): : 1 - 18
  • [2] Incremental Cross-Modality Deep Learning for Pedestrian Recognition
    Pop, Danut Ovidiu
    Rogozan, Alexandrina
    Nashashibi, Fawzi
    Bensrhair, Abdelaziz
    2017 28TH IEEE INTELLIGENT VEHICLES SYMPOSIUM (IV 2017), 2017, : 523 - 528
  • [3] Cross-Modality Gesture Recognition With Complete Representation Projection
    Liu, Xiaokai
    Li, Mingyue
    Zhang, Boyi
    Hao, Luyuan
    Ma, Xiaorui
    Wang, Jie
    IEEE INTERNET OF THINGS JOURNAL, 2024, 11 (09): : 16184 - 16195
  • [4] THE PICTURE SUPERIORITY EFFECT IN A CROSS-MODALITY RECOGNITION TASK
    STENBERG, G
    RADEBORG, K
    HEDMAN, LR
    MEMORY & COGNITION, 1995, 23 (04) : 425 - 441
  • [5] Camera-LiDAR Cross-Modality Gait Recognition
    Guo, Wenxuan
    Liang, Yingping
    Pan, Zhiyu
    Xi, Ziheng
    Feng, Jianjiang
    Zhou, Jie
    COMPUTER VISION - ECCV 2024, PT XXXIV, 2025, 15092 : 439 - 455
  • [6] PATTERN-RECOGNITION IN CROSS-MODALITY LETTER MATCHING
    TEMPANY, C
    BULLETIN OF THE BRITISH PSYCHOLOGICAL SOCIETY, 1981, 34 (JAN): : 38 - 38
  • [7] Cross-Linguistic Cross-modality Perception of English Sad and Happy Speech
    Menezes, Caroline
    Erickson, Donna
    Han, Jonghye
    PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON SPEECH PROSODY, VOLS I AND II, 2012, : 649 - 652
  • [8] CROSS-MODALITY MATCHING
    AUERBACH, C
    QUARTERLY JOURNAL OF EXPERIMENTAL PSYCHOLOGY, 1973, 25 (NOV): : 492 - 495
  • [9] Cross-Modality Domain Adaptation for hand-vein recognition
    Yang, Shuqiang
    Qin, Huafeng
    El-Yacoubi, Mmounim A.
    Liu, Chongwen
    2021 INTERNATIONAL CONFERENCE ON CYBER-PHYSICAL SOCIAL INTELLIGENCE (ICCSI), 2021,
  • [10] Cross-modality translations improve recognition by reducing false alarms
    Forrin, Noah D.
    MacLeod, Colin M.
    MEMORY, 2018, 26 (01) : 53 - 58