Audio-Visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks

被引:152
|
作者
Hou, Jen-Cheng [1 ]
Wang, Syu-Siang [2 ]
Lai, Ying-Hui [3 ]
Tsao, Yu [1 ]
Chang, Hsiu-Wen [4 ]
Wang, Hsin-Min [5 ]
机构
[1] Acad Sinica, Res Ctr Informat Technol Innovat, Taipei 11529, Taiwan
[2] Natl Taiwan Univ, Grad Inst Commun Engn, Taipei 10617, Taiwan
[3] Natl Yang Ming Univ, Dept Biomed Engn, Taipei 112, Taiwan
[4] Mackay Med Coll, Dept Audiol & Speech Language Pathol, New Taipei 252, Taiwan
[5] Acad Sinica, Inst Informat Sci, Taipei 11529, Taiwan
关键词
Audio-visual systems; deep convolutional neural networks; multimodal learning; speech enhancement; VOICE ACTIVITY DETECTION; NOISE-REDUCTION; SOURCE SEPARATION; DENOISING AUTOENCODER; RECOGNITION; ALGORITHMS; INTELLIGIBILITY;
D O I
10.1109/TETCI.2017.2784878
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Speech enhancement (SE) aims to reduce noise in speech signals. Most SE techniques focus only on addressing audio information. In this paper, inspired by multimodal learning, which utilizes data from different modalities, and the recent success of convolutional neural networks (CNNs) in SE, we propose an audio-visual deep CNNs (AVDCNN) SE model, which incorporates audio and visual streams into a unified network model. We also propose a multitask learning framework for reconstructing audio and visual signals at the output layer. Precisely speaking, the proposed AVDCNN model is structured as an audio-visual encoder-decoder network, in which audio and visual data are first processed using individual CNNs, and then fused into a joint network to generate enhanced speech (the primary task) and reconstructed images (the secondary task) at the output layer. The model is trained in an endto-end manner, and parameters are jointly learned through back propagation. We evaluate enhanced speech using five instrumental criteria. Results show that the AVDCNN model yields a notably superior performance compared with an audio-only CNN-based SE model and two conventional SE approaches, confirming the effectiveness of integrating visual information into the SE process. In addition, the AVDCNN model also outperforms an existing audio-visual SE model, confirming its capability of effectively combining audio and visual information in SE.
引用
收藏
页码:117 / 128
页数:12
相关论文
共 50 条
  • [1] Audio-Visual Speech Enhancement using Deep Neural Networks
    Hou, Jen-Cheng
    Wang, Syu-Siang
    Lai, Ying-Hui
    Lin, Jen-Chun
    Tsao, Yu
    Chang, Hsiu-Wen
    Wang, Hsin-Min
    2016 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2016,
  • [2] Audio-Visual (Multimodal) Speech Recognition System Using Deep Neural Network
    Paulin, Hebsibah
    Milton, R. S.
    JanakiRaman, S.
    Chandraprabha, K.
    JOURNAL OF TESTING AND EVALUATION, 2019, 47 (06) : 3963 - 3974
  • [3] Multimodal Deep Convolutional Neural Network for Audio-Visual Emotion Recognition
    Zhang, Shiqing
    Zhang, Shiliang
    Huang, Tiejun
    Gao, Wen
    ICMR'16: PROCEEDINGS OF THE 2016 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2016, : 281 - 284
  • [4] DEEP MULTIMODAL LEARNING FOR AUDIO-VISUAL SPEECH RECOGNITION
    Mroueh, Youssef
    Marcheret, Etienne
    Goel, Vaibhava
    2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 2130 - 2134
  • [5] Detecting Audio-Visual Synchrony Using Deep Neural Networks
    Marcheret, Etienne
    Potamianos, Gerasimos
    Vopicka, Josef
    Goel, Vaibhava
    16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 548 - 552
  • [6] Audio-visual speech recognition using red exclusion and neural networks
    Lewis, TW
    Powers, DMW
    JOURNAL OF RESEARCH AND PRACTICE IN INFORMATION TECHNOLOGY, 2003, 35 (01): : 41 - 64
  • [7] IMPROVING AUDIO-VISUAL SPEECH RECOGNITION USING DEEP NEURAL NETWORKS WITH DYNAMIC STREAM RELIABILITY ESTIMATES
    Meutzner, Hendrik
    Ma, Ning
    Nickel, Robert
    Schymura, Christopher
    Kolossa, Dorothea
    2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 5320 - 5324
  • [8] Multimodal Learning Using 3D Audio-Visual Data or Audio-Visual Speech Recognition
    Su, Rongfeng
    Wang, Lan
    Liu, Xunying
    2017 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2017, : 40 - 43
  • [9] Lite Audio-Visual Speech Enhancement
    Chuang, Shang-Yi
    Tsao, Yu
    Lo, Chen-Chou
    Wang, Hsin-Min
    INTERSPEECH 2020, 2020, : 1131 - 1135
  • [10] Audio-visual enhancement of speech in noise
    Girin, L
    Schwartz, JL
    Feng, G
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2001, 109 (06): : 3007 - 3020