Active Speaker Detection with Audio-Visual Co-training

被引:12
|
作者
Chakravarty, Punarjay [1 ]
Zegers, Jeroen [2 ]
Tuytelaars, Tinne [1 ]
Van Hamme, Hugo [2 ]
机构
[1] Katholieke Univ Leuven, iMinds MMT, ESAT PSI, Leuven, Belgium
[2] Katholieke Univ Leuven, ESAT PSI, Leuven, Belgium
关键词
Active Speaker Detection; Audio-visual Co-training; RECOGNITION;
D O I
10.1145/2993148.2993172
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this work, we show how to co-train a classifier for active speaker detection using audio-visual data. First, audio Voice Activity Detection (VAD) is used to train a personalized video-based active speaker classifier in a weakly supervised fashion. The video classifier is in turn used to train a voice model for each person. The individual voice models are then used to detect active speakers. There is no manual supervision - audio weakly supervises video classification, and the co-training loop is completed by using the trained video classifier to supervise the training of a personalized audio voice classifier.
引用
收藏
页码:312 / 316
页数:5
相关论文
共 50 条
  • [31] Audio-visual speaker identification based on the use of dynamic audio and visual features
    Fox, N
    Reilly, RB
    AUDIO-BASED AND VIDEO-BASED BIOMETRIC PERSON AUTHENTICATION, PROCEEDINGS, 2003, 2688 : 743 - 751
  • [32] Transfer of Audio-Visual Temporal Training to Temporal and Spatial Audio-Visual Tasks
    Suerig, Ralf
    Bottari, Davide
    Roeder, Brigitte
    MULTISENSORY RESEARCH, 2018, 31 (06) : 556 - 578
  • [33] A Visual Signal Reliability for Robust Audio-Visual Speaker Identification
    Tariquzzaman, Md.
    Kim, Jin Young
    Na, Seung You
    Kim, Hyoung-Gook
    Har, Dongsoo
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2011, E94D (10): : 2052 - 2055
  • [34] BEST OF BOTH WORLDS: MULTI-TASK AUDIO-VISUAL AUTOMATIC SPEECH RECOGNITION AND ACTIVE SPEAKER DETECTION
    Braga, Otavio
    Siohan, Olivier
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6047 - 6051
  • [35] Particle Filtering for Bearing-Only Audio-Visual Speaker Detection and Tracking
    Rae, Andrew
    Khamis, Alaa
    Basir, Otman
    Kamel, Mohamed
    2009 3RD INTERNATIONAL CONFERENCE ON SIGNALS, CIRCUITS AND SYSTEMS (SCS 2009), 2009, : 161 - +
  • [36] E-Talk: Accelerating Active Speaker Detection with Audio-Visual Fusion and Edge-Cloud Computing
    Yu, Xiaojing
    Zhang, Lan
    Li, Xiang-yang
    2023 20TH ANNUAL IEEE INTERNATIONAL CONFERENCE ON SENSING, COMMUNICATION, AND NETWORKING, SECON, 2023,
  • [37] Boosting and structure learning in dynamic Bayesian networks for audio-visual speaker detection
    Choudhury, T
    Rehg, JM
    Pavlovic, V
    Pentland, A
    16TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL III, PROCEEDINGS, 2002, : 789 - 794
  • [38] The 'Audio-Visual Face Cover Corpus': Investigations into audio-visual speech and speaker recognition when the speaker's face is occluded by facewear
    Fecher, Natalie
    13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 2247 - 2250
  • [39] Audio-Visual Speech Recognition in the Presence of a Competing Speaker
    Shao, Xu
    Barker, Jon
    INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, 2006, : 1292 - 1295
  • [40] Speaker and digit recognition by audio-visual lip biometrics
    Faraj, Maycel Isaac
    Bigun, Josef
    ADVANCES IN BIOMETRICS, PROCEEDINGS, 2007, 4642 : 1016 - +