Mixture of Inference Networks for VAE-Based Audio-Visual Speech Enhancement

被引：8

作者：

Sadeghi, Mostafa ^{[1
]}

Alameda-Pineda, Xavier ^{[2
]}

机构：

[1] Inria Res Ctr Nancy Grand Est, Multispeech Team, F-54600 Villers Les Nancy, France

[2] Inria Ctr Rech Grenoble Rhone Alpes, Percept Team, F-38334 Montbonnot St Martin, France

来源：

IEEE TRANSACTIONS ON SIGNAL PROCESSING | 2021年 / 69卷 / 69期

基金：

欧盟地平线“2020”;

关键词：

Speech enhancement; Visualization; Spectrogram; Decoding; Noise measurement; Neural networks; Data models; Audio-visual speech enhancement; generative models; variational auto-encoder; mixture model; FACTORIZATION;

D O I：

10.1109/TSP.2021.3066038

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

We address unsupervised audio-visual speech enhancement based on variational autoencoders (VAEs), where the prior distribution of clean speech spectrogram is simulated using an encoder-decoder architecture. At enhancement (test) time, the trained generative model (decoder) is combined with a noise model whose parameters need to be estimated. The initialization of the latent variables describing the generative process of the clean speech via the decoder, is crucial, as the overall inference problem is non-convex. This is usually done by using the output of the trained encoder given the noisy audio and clean visual data as input. Current audio-visual VAE models do not provide an effective initialization because the two modalities are tightly coupled (concatenated) in the associated architectures. To overcome this issue, we introduce the mixture of inference networks variational autoencoder (MIN-VAE). Two encoder networks input, respectively, audio and visual data, and the posterior of the latent variables is modeled as a mixture of two Gaussian distributions output from each encoder. The mixture variable is also latent, and therefore learning the optimal balance between the audio and visual encoders is unsupervised as well. By training a shared decoder, the overall network learns to adaptively fuse the two modalities. Moreover, at test time, the visual encoder, taking (clean) visual data, is used for initialization. A variational inference approach is derived to train the proposed model. Thanks to the novel inference procedure and the robust initialization, the MIN-VAE exhibits superior performance than the standard audio-only as well as audio-visual counterparts.

引用

页码：1899 / 1909

页数：11

共 50 条

[1] Audio-visual speech recognition based on joint training with audio-visual speech enhancement for robust speech recognition
Hwang, Jung-Wook
Park, Jeongkyun
Park, Rae-Hong
Park, Hyung-Min
APPLIED ACOUSTICS, 2023, 211
[2] Audio-Visual Speech Enhancement using Deep Neural Networks
Hou, Jen-Cheng
Wang, Syu-Siang
Lai, Ying-Hui
Lin, Jen-Chun
Tsao, Yu
Chang, Hsiu-Wen
Wang, Hsin-Min
2016 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2016,
[3] Edged based Audio-Visual Speech enhancement demonstrator
Chen, Song
Gogate, Mandar
Dashtipour, Kia
Kirton-Wingate, Jasper
Hussain, Adeel
Doctor, Faiyaz
Arslan, Tughrul
Hussain, Amir
INTERSPEECH 2024, 2024, : 2032 - 2033
[4] Inventory-Based Audio-Visual Speech Enhancement
Kolossa, Dorothea
Nickel, Robert
Zeiler, Steffen
Martin, Rainer
13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 586 - 589
[5] Lite Audio-Visual Speech Enhancement
Chuang, Shang-Yi
Tsao, Yu
Lo, Chen-Chou
Wang, Hsin-Min
INTERSPEECH 2020, 2020, : 1131 - 1135
[6] Audio-visual enhancement of speech in noise
Girin, L
Schwartz, JL
Feng, G
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2001, 109 (06): : 3007 - 3020
[7] Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis
Yang, Karren
Markovic, Dejan
Krenn, Steven
Agrawal, Vasu
Richard, Alexander
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 8217 - 8227
[8] ROBUST UNSUPERVISED AUDIO-VISUAL SPEECH ENHANCEMENT USING A MIXTURE OF VARIATIONAL AUTOENCODERS
Sadeghi, Mostafa
Alameda-Pineda, Xavier
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7534 - 7538
[9] TWIN-HMM-BASED AUDIO-VISUAL SPEECH ENHANCEMENT
Abdelaziz, Ahmed Hussen
Zeiler, Steffen
Kolossa, Dorothea
2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 3726 - 3730
[10] Audio-visual speech enhancement with AVCDCN (audio-visual codebook dependent cepstral normalization)
Deligne, S
Potamianos, G
Neti, C
SAM2002: IEEE SENSOR ARRAY AND MULTICHANNEL SIGNAL PROCESSING WORKSHOP PROCEEDINGS, 2002, : 68 - 71

← 1 2 3 4 5 →