Variational Bayesian Inference for Audio-Visual Tracking of Multiple Speakers

被引:19
|
作者
Ban, Yutong [1 ]
Alameda-Pineda, Xavier [1 ]
Girin, Laurent [2 ]
Horaud, Radu [1 ]
机构
[1] Inria Grenoble Rhone Alpes, Montbonnot St Martin, France
[2] Univ Grenoble Alpes, GIPSA Lab, F-38400 St Martin Dheres, France
关键词
Visualization; Target tracking; Acoustics; Bayes methods; Cameras; Object tracking; Direction-of-arrival estimation; Audio-visual tracking; multiple object tracking; dynamic Bayesian networks; variational inference; expectation-maximization; speaker diarization; LOCALIZATION; DIARIZATION;
D O I
10.1109/TPAMI.2019.2953020
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this article, we address the problem of tracking multiple speakers via the fusion of visual and auditory information. We propose to exploit the complementary nature and roles of these two modalities in order to accurately estimate smooth trajectories of the tracked persons, to deal with the partial or total absence of one of the modalities over short periods of time, and to estimate the acoustic status-either speaking or silent-of each tracked person over time. We propose to cast the problem at hand into a generative audio-visual fusion (or association) model formulated as a latent-variable temporal graphical model. This may well be viewed as the problem of maximizing the posterior joint distribution of a set of continuous and discrete latent variables given the past and current observations, which is intractable. We propose a variational inference model which amounts to approximate the joint distribution with a factorized distribution. The solution takes the form of a closed-form expectation maximization procedure. We describe in detail the inference algorithm, we evaluate its performance and we compare it with several baseline methods. These experiments show that the proposed audio-visual tracker performs well in informal meetings involving a time-varying number of people.
引用
收藏
页码:1761 / 1776
页数:16
相关论文
共 50 条
  • [21] Multimodal tracking and classification of audio-visual features
    Pavlovic, V
    1998 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING - PROCEEDINGS, VOL 1, 1998, : 343 - 347
  • [22] Exploring the effectiveness of auditory, visual, and audio-visual sensory cues in a multiple object tracking environment
    Julia Föcker
    Polly Atkins
    Foivos-Christos Vantzos
    Maximilian Wilhelm
    Thomas Schenk
    Hauke S. Meyerhoff
    Attention, Perception, & Psychophysics, 2022, 84 : 1611 - 1624
  • [23] The Impact of Audio-Visual, Visual and Auditory Cues on Multiple Object Tracking Performance in Children with Autism
    Hughes, Lily
    Kargas, Niko
    Wilhelm, Maximilian
    Meyerhoff, Hauke S. S.
    Foecker, Julia
    PERCEPTUAL AND MOTOR SKILLS, 2023, 130 (05) : 2047 - 2068
  • [24] Exploring the effectiveness of auditory, visual, and audio-visual sensory cues in a multiple object tracking environment
    Foecker, Julia
    Atkins, Polly
    Vantzos, Foivos-Christos
    Wilhelm, Maximilian
    Schenk, Thomas
    Meyerhoff, Hauke S.
    ATTENTION PERCEPTION & PSYCHOPHYSICS, 2022, 84 (05) : 1611 - 1624
  • [25] Dynamic Bayesian Networks for audio-visual speaker recognition
    Li, DD
    Yang, YC
    Wu, ZH
    ADVANCES IN BIOMETRICS, PROCEEDINGS, 2006, 3832 : 539 - 545
  • [26] Dynamic Bayesian Networks for Audio-Visual Speech Recognition
    Ara V. Nefian
    Luhong Liang
    Xiaobo Pi
    Xiaoxing Liu
    Kevin Murphy
    EURASIP Journal on Advances in Signal Processing, 2002
  • [27] VARIATIONAL BAYESIAN INFERENCE FOR STEREO OBJECT TRACKING
    Chantas, Giannis
    Nikolaidis, Nikos
    Pitas, Ioannis
    2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 2439 - 2443
  • [28] Dynamic Bayesian networks for audio-visual speech recognition
    Nefian, AV
    Liang, LH
    Pi, XB
    Liu, XX
    Murphy, K
    EURASIP JOURNAL ON APPLIED SIGNAL PROCESSING, 2002, 2002 (11) : 1274 - 1288
  • [29] Joint Audio-Visual Tracking Using Particle Filters
    Dmitry N. Zotkin
    Ramani Duraiswami
    Larry S. Davis
    EURASIP Journal on Advances in Signal Processing, 2002
  • [30] Audio-Visual Detection of Multiple Chirping Robots
    Gribovskiy, Alexey
    Mondada, Francesco
    IAS-10: INTELLIGENT AUTONOMOUS SYSTEMS 10, 2008, : 324 - 331