Separation of Audio-Visual Speech Sources: A New Approach Exploiting the Audio-Visual Coherence of Speech Stimuli

被引:0
|
作者
David Sodoyer
Jean-Luc Schwartz
Laurent Girin
Jacob Klinkisch
Christian Jutten
机构
[1] Université Stendhal,Institut de la Communication Parlée, Institut National Polytechnique de Grenoble
关键词
blind source separation; lipreading; audio-visual speech processing;
D O I
暂无
中图分类号
学科分类号
摘要
We present a new approach to the source separation problem in the case of multiple speech signals. The method is based on the use of automatic lipreading, the objective is to extract an acoustic speech signal from other acoustic signals by exploiting its coherence with the speaker′s lip movements. We consider the case of an additive stationary mixture of decorrelated sources, with no further assumptions on independence or non-Gaussian character. Firstly, we present a theoretical framework showing that it is indeed possible to separate a source when some of its spectral characteristics are provided to the system. Then we address the case of audio-visual sources. We show how, if a statistical model of the joint probability of visual and spectral audio input is learnt to quantify the audio-visual coherence, separation can be achieved by maximizing this probability. Finally, we present a number of separation results on a corpus of vowel-plosive-vowel sequences uttered by a single speaker, embedded in a mixture of other voices. We show that separation can be quite good for mixtures of 2, 3, and 5 sources. These results, while very preliminary, are encouraging, and are discussed in respect to their potential complementarity with traditional pure audio separation or enhancement techniques.
引用
收藏
相关论文
共 50 条
  • [31] Audio-visual enhancement of speech in noise
    Girin, L
    Schwartz, JL
    Feng, G
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2001, 109 (06): : 3007 - 3020
  • [32] Audio-visual speech processing and attention
    Sams, M
    PSYCHOPHYSIOLOGY, 2003, 40 : S5 - S6
  • [33] Audio-Visual Speech Recognition in Noisy Audio Environments
    Palecek, Karel
    Chaloupka, Josef
    2013 36TH INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS AND SIGNAL PROCESSING (TSP), 2013, : 484 - 487
  • [34] Audio-Visual Speech Separation with Visual Features Enhanced by Adversarial Training
    Zhang, Peng
    Xu, Jiaming
    Shi, Jing
    Hao, Yunzhe
    Qin, Lei
    Xu, Bo
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [35] Audio-Visual Speech Modeling for Continuous Speech Recognition
    Dupont, Stephane
    Luettin, Juergen
    IEEE TRANSACTIONS ON MULTIMEDIA, 2000, 2 (03) : 141 - 151
  • [36] Audio-visual speech perception without speech cues
    Saldana, HM
    Pisoni, DB
    Fellowes, JM
    Remez, RE
    ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, 1996, : 2187 - 2190
  • [37] Perceptual congruency of audio-visual speech affects ventriloquism with bilateral visual stimuli
    Shoko Kanaya
    Kazuhiko Yokosawa
    Psychonomic Bulletin & Review, 2011, 18 : 123 - 128
  • [38] Perceptual congruency of audio-visual speech affects ventriloquism with bilateral visual stimuli
    Kanaya, Shoko
    Yokosawa, Kazuhiko
    PSYCHONOMIC BULLETIN & REVIEW, 2011, 18 (01) : 123 - 128
  • [39] Multi-Speaker Audio-Visual Corpus RUSAVIC: Russian Audio-Visual Speech in Cars
    Ivanko, Denis
    Ryumin, Dmitry
    Axyonov, Alexandr
    Kashevnik, Alexey
    Karpov, Alexey
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 1555 - 1559
  • [40] FaceFilter: Audio-visual speech separation using still images
    Chung, Soo-Whan
    Choe, Soyeon
    Chung, Joon Son
    Kang, Hong-Goo
    INTERSPEECH 2020, 2020, : 3481 - 3485