Variational Bayesian Inference for Audio-Visual Tracking of Multiple Speakers

被引:19
|
作者
Ban, Yutong [1 ]
Alameda-Pineda, Xavier [1 ]
Girin, Laurent [2 ]
Horaud, Radu [1 ]
机构
[1] Inria Grenoble Rhone Alpes, Montbonnot St Martin, France
[2] Univ Grenoble Alpes, GIPSA Lab, F-38400 St Martin Dheres, France
关键词
Visualization; Target tracking; Acoustics; Bayes methods; Cameras; Object tracking; Direction-of-arrival estimation; Audio-visual tracking; multiple object tracking; dynamic Bayesian networks; variational inference; expectation-maximization; speaker diarization; LOCALIZATION; DIARIZATION;
D O I
10.1109/TPAMI.2019.2953020
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this article, we address the problem of tracking multiple speakers via the fusion of visual and auditory information. We propose to exploit the complementary nature and roles of these two modalities in order to accurately estimate smooth trajectories of the tracked persons, to deal with the partial or total absence of one of the modalities over short periods of time, and to estimate the acoustic status-either speaking or silent-of each tracked person over time. We propose to cast the problem at hand into a generative audio-visual fusion (or association) model formulated as a latent-variable temporal graphical model. This may well be viewed as the problem of maximizing the posterior joint distribution of a set of continuous and discrete latent variables given the past and current observations, which is intractable. We propose a variational inference model which amounts to approximate the joint distribution with a factorized distribution. The solution takes the form of a closed-form expectation maximization procedure. We describe in detail the inference algorithm, we evaluate its performance and we compare it with several baseline methods. These experiments show that the proposed audio-visual tracker performs well in informal meetings involving a time-varying number of people.
引用
收藏
页码:1761 / 1776
页数:16
相关论文
共 50 条
  • [1] Audio-Visual Tracking of Concurrent Speakers
    Qian, Xinyuan
    Brutti, Alessio
    Lanz, Oswald
    Omologo, Maurizio
    Cavallaro, Andrea
    IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 942 - 954
  • [2] AUDIO-VISUAL TRACKING OF MULTIPLE SPEAKERS VIA A PMBM FILTER
    Zhao, Jinzheng
    Wu, Peipei
    Liu, Xubo
    Xu, Yong
    Mihaylova, Lyudmila
    Godsill, Simon
    Wang, Wenwu
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 5068 - 5072
  • [3] Audio-Visual Data Fusion for Tracking the Direction of Multiple Speakers
    Quang Nguyen
    Choi, JongSuk
    INTERNATIONAL CONFERENCE ON CONTROL, AUTOMATION AND SYSTEMS (ICCAS 2010), 2010, : 1626 - 1630
  • [4] Audio-Visual Tracking of a Variable Number of Speakers with a Random Finite Set Approach
    Kilic, Volkan
    Zhong, Xionghu
    Barnard, Mark
    Wang, Wenwu
    Kittler, Josef
    2014 17TH INTERNATIONAL CONFERENCE ON INFORMATION FUSION (FUSION), 2014,
  • [5] Audio-Visual Variational Fusion for Multi-Person Tracking with Robots
    Alameda-Pineda, Xavier
    Arias, Soraya
    Ban, Yutong
    Delorme, Guillaume
    Girin, Laurent
    Horaud, Radu
    Li, Xiaofei
    Mourgue, Bastien
    Sarrazin, Guillaume
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 1059 - 1061
  • [6] AUDIO-VISUAL TRACKING BY DENSITY APPROXIMATION IN A SEQUENTIAL BAYESIAN FILTERING FRAMEWORK
    Gebru, Israel D.
    Evers, Christine
    Naylor, Patrick A.
    Horaud, Radu
    2017 HANDS-FREE SPEECH COMMUNICATIONS AND MICROPHONE ARRAYS (HSCMA 2017), 2017, : 71 - 75
  • [7] Adaptive monocular multiple object tracking with variational Bayesian inference
    Tian, Shixia
    INTERNATIONAL JOURNAL OF GENERAL SYSTEMS, 2025,
  • [8] Audio-visual tracking for natural interactivity
    Pingali, G
    Tunali, G
    Carlbom, I
    ACM MULTIMEDIA 99, PROCEEDINGS, 1999, : 373 - 382
  • [9] Visual Tracking with Sparse Prototypes: An Approach Based on Variational Bayesian Inference
    Hu, Lei
    Wang, Jun
    Wu, Zemin
    Zhang, Lei
    2018 IEEE 3RD INTERNATIONAL CONFERENCE ON IMAGE, VISION AND COMPUTING (ICIVC), 2018, : 560 - 565
  • [10] Variational inference for visual tracking
    Vermaak, J
    Lawrence, ND
    Pérez, P
    2003 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOL 1, PROCEEDINGS, 2003, : 773 - 780