The Conversation: Deep Audio -Visual Speech Enhancement

被引:0
|
作者
Afouras, Triantafyllos [1 ]
Chung, Joon Son [1 ]
Zisserman, Andrew [1 ]
机构
[1] Univ Oxford, Dept Engn Sci, Visual Geometry Grp, Oxford, England
基金
英国工程与自然科学研究理事会;
关键词
speech enhancement; speech separation;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Our goal is to isolate individual speakers from multi-talker simultaneous speech in videos. Existing works in this area have focussed on trying to separate utterances from known speakers in controlled environments. In this paper, we propose a deep audio-visual speech enhancement network that is able to separate a speaker's voice given lip regions in the corresponding video, by predicting both the magnitude and the phase of the target signal. The method is applicable to speakers unheard and unseen during training, and for unconstrained environments. We demonstrate strong quantitative and qualitative results, isolating extremely challenging real-world examples.
引用
收藏
页码:3244 / 3248
页数:5
相关论文
共 50 条
  • [21] ON TRAINING TARGETS AND OBJECTIVE FUNCTIONS FOR DEEP-LEARNING-BASED AUDIO-VISUAL SPEECH ENHANCEMENT
    Michelsanti, Daniel
    Tan, Zheng-Hua
    Sigurdsson, Sigurdur
    Jensen, Jesper
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 8077 - 8081
  • [22] DEEP MULTIMODAL LEARNING FOR AUDIO-VISUAL SPEECH RECOGNITION
    Mroueh, Youssef
    Marcheret, Etienne
    Goel, Vaibhava
    2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 2130 - 2134
  • [23] DEEP AUDIO-VISUAL SPEECH SEPARATION WITH ATTENTION MECHANISM
    Li, Chenda
    Qian, Yanmin
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7314 - 7318
  • [24] Audio-visual speech recognition using deep learning
    Noda, Kuniaki
    Yamaguchi, Yuki
    Nakadai, Kazuhiro
    Okuno, Hiroshi G.
    Ogata, Tetsuya
    APPLIED INTELLIGENCE, 2015, 42 (04) : 722 - 737
  • [25] Audio-visual speech recognition using deep learning
    Kuniaki Noda
    Yuki Yamaguchi
    Kazuhiro Nakadai
    Hiroshi G. Okuno
    Tetsuya Ogata
    Applied Intelligence, 2015, 42 : 722 - 737
  • [26] EFFECTS OF LOMBARD REFLEX ON THE PERFORMANCE OF DEEP-LEARNING-BASED AUDIO-VISUAL SPEECH ENHANCEMENT SYSTEMS
    Michelsanti, Daniel
    Tan, Zheng-Hua
    Sigurdsson, Sigurdur
    Jensen, Jesper
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6615 - 6619
  • [27] Speech enhancement and recognition in meetings with an audio-visual sensor array
    Maganti, Hari Krishna
    Gatica-Perez, Daniel
    McCowan, Iain
    IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2007, 15 (08): : 2257 - 2269
  • [28] TWIN-HMM-BASED AUDIO-VISUAL SPEECH ENHANCEMENT
    Abdelaziz, Ahmed Hussen
    Zeiler, Steffen
    Kolossa, Dorothea
    2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 3726 - 3730
  • [29] Incorporating Ultrasound Tongue Images for Audio-Visual Speech Enhancement
    Zheng, Rui-Chen
    Ai, Yang
    Ling, Zhen-Hua
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 1430 - 1444
  • [30] THE IMPACT OF REMOVING HEAD MOVEMENTS ON AUDIO-VISUAL SPEECH ENHANCEMENT
    Kang, Zhiqi
    Sadeghi, Mostafa
    Horaud, Radu
    Alameda-Pineda, Xavier
    Donley, Jacob
    Kumar, Anurag
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7302 - 7306