Audio-Visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks

被引：152

作者：

Hou, Jen-Cheng ^{[1
]}

Wang, Syu-Siang ^{[2
]}

Lai, Ying-Hui ^{[3
]}

Tsao, Yu ^{[1
]}

Chang, Hsiu-Wen ^{[4
]}

Wang, Hsin-Min ^{[5
]}

机构：

[1] Acad Sinica, Res Ctr Informat Technol Innovat, Taipei 11529, Taiwan

[2] Natl Taiwan Univ, Grad Inst Commun Engn, Taipei 10617, Taiwan

[3] Natl Yang Ming Univ, Dept Biomed Engn, Taipei 112, Taiwan

[4] Mackay Med Coll, Dept Audiol & Speech Language Pathol, New Taipei 252, Taiwan

[5] Acad Sinica, Inst Informat Sci, Taipei 11529, Taiwan

来源：

IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE | 2018年 / 2卷 / 02期

关键词：

Audio-visual systems; deep convolutional neural networks; multimodal learning; speech enhancement; VOICE ACTIVITY DETECTION; NOISE-REDUCTION; SOURCE SEPARATION; DENOISING AUTOENCODER; RECOGNITION; ALGORITHMS; INTELLIGIBILITY;

D O I：

10.1109/TETCI.2017.2784878

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Speech enhancement (SE) aims to reduce noise in speech signals. Most SE techniques focus only on addressing audio information. In this paper, inspired by multimodal learning, which utilizes data from different modalities, and the recent success of convolutional neural networks (CNNs) in SE, we propose an audio-visual deep CNNs (AVDCNN) SE model, which incorporates audio and visual streams into a unified network model. We also propose a multitask learning framework for reconstructing audio and visual signals at the output layer. Precisely speaking, the proposed AVDCNN model is structured as an audio-visual encoder-decoder network, in which audio and visual data are first processed using individual CNNs, and then fused into a joint network to generate enhanced speech (the primary task) and reconstructed images (the secondary task) at the output layer. The model is trained in an endto-end manner, and parameters are jointly learned through back propagation. We evaluate enhanced speech using five instrumental criteria. Results show that the AVDCNN model yields a notably superior performance compared with an audio-only CNN-based SE model and two conventional SE approaches, confirming the effectiveness of integrating visual information into the SE process. In addition, the AVDCNN model also outperforms an existing audio-visual SE model, confirming its capability of effectively combining audio and visual information in SE.

引用

页码：117 / 128

页数：12

共 50 条

[1] Audio-Visual Speech Enhancement using Deep Neural Networks
Hou, Jen-Cheng
Wang, Syu-Siang
Lai, Ying-Hui
Lin, Jen-Chun
Tsao, Yu
Chang, Hsiu-Wen
Wang, Hsin-Min
2016 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2016,
[2] Audio-Visual (Multimodal) Speech Recognition System Using Deep Neural Network
Paulin, Hebsibah
Milton, R. S.
JanakiRaman, S.
Chandraprabha, K.
JOURNAL OF TESTING AND EVALUATION, 2019, 47 (06) : 3963 - 3974
[3] Multimodal Deep Convolutional Neural Network for Audio-Visual Emotion Recognition
Zhang, Shiqing
Zhang, Shiliang
Huang, Tiejun
Gao, Wen
ICMR'16: PROCEEDINGS OF THE 2016 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2016, : 281 - 284
[4] DEEP MULTIMODAL LEARNING FOR AUDIO-VISUAL SPEECH RECOGNITION
Mroueh, Youssef
Marcheret, Etienne
Goel, Vaibhava
2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 2130 - 2134
[5] Detecting Audio-Visual Synchrony Using Deep Neural Networks
Marcheret, Etienne
Potamianos, Gerasimos
Vopicka, Josef
Goel, Vaibhava
16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 548 - 552
[6] Audio-visual speech recognition using red exclusion and neural networks
Lewis, TW
Powers, DMW
JOURNAL OF RESEARCH AND PRACTICE IN INFORMATION TECHNOLOGY, 2003, 35 (01): : 41 - 64
[7] IMPROVING AUDIO-VISUAL SPEECH RECOGNITION USING DEEP NEURAL NETWORKS WITH DYNAMIC STREAM RELIABILITY ESTIMATES
Meutzner, Hendrik
Ma, Ning
Nickel, Robert
Schymura, Christopher
Kolossa, Dorothea
2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 5320 - 5324
[8] Multimodal Learning Using 3D Audio-Visual Data or Audio-Visual Speech Recognition
Su, Rongfeng
Wang, Lan
Liu, Xunying
2017 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2017, : 40 - 43
[9] Lite Audio-Visual Speech Enhancement
Chuang, Shang-Yi
Tsao, Yu
Lo, Chen-Chou
Wang, Hsin-Min
INTERSPEECH 2020, 2020, : 1131 - 1135
[10] Audio-visual enhancement of speech in noise
Girin, L
Schwartz, JL
Feng, G
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2001, 109 (06): : 3007 - 3020

← 1 2 3 4 5 →