Mixture of Inference Networks for VAE-Based Audio-Visual Speech Enhancement

被引:8
|
作者
Sadeghi, Mostafa [1 ]
Alameda-Pineda, Xavier [2 ]
机构
[1] Inria Res Ctr Nancy Grand Est, Multispeech Team, F-54600 Villers Les Nancy, France
[2] Inria Ctr Rech Grenoble Rhone Alpes, Percept Team, F-38334 Montbonnot St Martin, France
基金
欧盟地平线“2020”;
关键词
Speech enhancement; Visualization; Spectrogram; Decoding; Noise measurement; Neural networks; Data models; Audio-visual speech enhancement; generative models; variational auto-encoder; mixture model; FACTORIZATION;
D O I
10.1109/TSP.2021.3066038
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
We address unsupervised audio-visual speech enhancement based on variational autoencoders (VAEs), where the prior distribution of clean speech spectrogram is simulated using an encoder-decoder architecture. At enhancement (test) time, the trained generative model (decoder) is combined with a noise model whose parameters need to be estimated. The initialization of the latent variables describing the generative process of the clean speech via the decoder, is crucial, as the overall inference problem is non-convex. This is usually done by using the output of the trained encoder given the noisy audio and clean visual data as input. Current audio-visual VAE models do not provide an effective initialization because the two modalities are tightly coupled (concatenated) in the associated architectures. To overcome this issue, we introduce the mixture of inference networks variational autoencoder (MIN-VAE). Two encoder networks input, respectively, audio and visual data, and the posterior of the latent variables is modeled as a mixture of two Gaussian distributions output from each encoder. The mixture variable is also latent, and therefore learning the optimal balance between the audio and visual encoders is unsupervised as well. By training a shared decoder, the overall network learns to adaptively fuse the two modalities. Moreover, at test time, the visual encoder, taking (clean) visual data, is used for initialization. A variational inference approach is derived to train the proposed model. Thanks to the novel inference procedure and the robust initialization, the MIN-VAE exhibits superior performance than the standard audio-only as well as audio-visual counterparts.
引用
收藏
页码:1899 / 1909
页数:11
相关论文
共 50 条
  • [1] Audio-visual speech recognition based on joint training with audio-visual speech enhancement for robust speech recognition
    Hwang, Jung-Wook
    Park, Jeongkyun
    Park, Rae-Hong
    Park, Hyung-Min
    APPLIED ACOUSTICS, 2023, 211
  • [2] Audio-Visual Speech Enhancement using Deep Neural Networks
    Hou, Jen-Cheng
    Wang, Syu-Siang
    Lai, Ying-Hui
    Lin, Jen-Chun
    Tsao, Yu
    Chang, Hsiu-Wen
    Wang, Hsin-Min
    2016 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2016,
  • [3] Edged based Audio-Visual Speech enhancement demonstrator
    Chen, Song
    Gogate, Mandar
    Dashtipour, Kia
    Kirton-Wingate, Jasper
    Hussain, Adeel
    Doctor, Faiyaz
    Arslan, Tughrul
    Hussain, Amir
    INTERSPEECH 2024, 2024, : 2032 - 2033
  • [4] Inventory-Based Audio-Visual Speech Enhancement
    Kolossa, Dorothea
    Nickel, Robert
    Zeiler, Steffen
    Martin, Rainer
    13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 586 - 589
  • [5] Lite Audio-Visual Speech Enhancement
    Chuang, Shang-Yi
    Tsao, Yu
    Lo, Chen-Chou
    Wang, Hsin-Min
    INTERSPEECH 2020, 2020, : 1131 - 1135
  • [6] Audio-visual enhancement of speech in noise
    Girin, L
    Schwartz, JL
    Feng, G
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2001, 109 (06): : 3007 - 3020
  • [7] Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis
    Yang, Karren
    Markovic, Dejan
    Krenn, Steven
    Agrawal, Vasu
    Richard, Alexander
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 8217 - 8227
  • [8] ROBUST UNSUPERVISED AUDIO-VISUAL SPEECH ENHANCEMENT USING A MIXTURE OF VARIATIONAL AUTOENCODERS
    Sadeghi, Mostafa
    Alameda-Pineda, Xavier
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7534 - 7538
  • [9] TWIN-HMM-BASED AUDIO-VISUAL SPEECH ENHANCEMENT
    Abdelaziz, Ahmed Hussen
    Zeiler, Steffen
    Kolossa, Dorothea
    2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 3726 - 3730
  • [10] Audio-visual speech enhancement with AVCDCN (audio-visual codebook dependent cepstral normalization)
    Deligne, S
    Potamianos, G
    Neti, C
    SAM2002: IEEE SENSOR ARRAY AND MULTICHANNEL SIGNAL PROCESSING WORKSHOP PROCEEDINGS, 2002, : 68 - 71