Active Speaker Detection with Audio-Visual Co-training

被引：12

作者：

Chakravarty, Punarjay ^{[1
]}

Zegers, Jeroen ^{[2
]}

Tuytelaars, Tinne ^{[1
]}

Van Hamme, Hugo ^{[2
]}

机构：

[1] Katholieke Univ Leuven, iMinds MMT, ESAT PSI, Leuven, Belgium

[2] Katholieke Univ Leuven, ESAT PSI, Leuven, Belgium

来源：

ICMI'16: PROCEEDINGS OF THE 18TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION | 2016年

关键词：

Active Speaker Detection; Audio-visual Co-training; RECOGNITION;

D O I：

10.1145/2993148.2993172

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this work, we show how to co-train a classifier for active speaker detection using audio-visual data. First, audio Voice Activity Detection (VAD) is used to train a personalized video-based active speaker classifier in a weakly supervised fashion. The video classifier is in turn used to train a voice model for each person. The individual voice models are then used to detect active speakers. There is no manual supervision - audio weakly supervises video classification, and the co-training loop is completed by using the trained video classifier to supervise the training of a personalized audio voice classifier.

引用

页码：312 / 316

页数：5

共 50 条

[31] Audio-visual speaker identification based on the use of dynamic audio and visual features
Fox, N
Reilly, RB
AUDIO-BASED AND VIDEO-BASED BIOMETRIC PERSON AUTHENTICATION, PROCEEDINGS, 2003, 2688 : 743 - 751
[32] Transfer of Audio-Visual Temporal Training to Temporal and Spatial Audio-Visual Tasks
Suerig, Ralf
Bottari, Davide
Roeder, Brigitte
MULTISENSORY RESEARCH, 2018, 31 (06) : 556 - 578
[33] A Visual Signal Reliability for Robust Audio-Visual Speaker Identification
Tariquzzaman, Md.
Kim, Jin Young
Na, Seung You
Kim, Hyoung-Gook
Har, Dongsoo
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2011, E94D (10): : 2052 - 2055
[34] BEST OF BOTH WORLDS: MULTI-TASK AUDIO-VISUAL AUTOMATIC SPEECH RECOGNITION AND ACTIVE SPEAKER DETECTION
Braga, Otavio
Siohan, Olivier
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6047 - 6051
[35] Particle Filtering for Bearing-Only Audio-Visual Speaker Detection and Tracking
Rae, Andrew
Khamis, Alaa
Basir, Otman
Kamel, Mohamed
2009 3RD INTERNATIONAL CONFERENCE ON SIGNALS, CIRCUITS AND SYSTEMS (SCS 2009), 2009, : 161 - +
[36] E-Talk: Accelerating Active Speaker Detection with Audio-Visual Fusion and Edge-Cloud Computing
Yu, Xiaojing
Zhang, Lan
Li, Xiang-yang
2023 20TH ANNUAL IEEE INTERNATIONAL CONFERENCE ON SENSING, COMMUNICATION, AND NETWORKING, SECON, 2023,
[37] Boosting and structure learning in dynamic Bayesian networks for audio-visual speaker detection
Choudhury, T
Rehg, JM
Pavlovic, V
Pentland, A
16TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL III, PROCEEDINGS, 2002, : 789 - 794
[38] The 'Audio-Visual Face Cover Corpus': Investigations into audio-visual speech and speaker recognition when the speaker's face is occluded by facewear
Fecher, Natalie
13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 2247 - 2250
[39] Audio-Visual Speech Recognition in the Presence of a Competing Speaker
Shao, Xu
Barker, Jon
INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, 2006, : 1292 - 1295
[40] Speaker and digit recognition by audio-visual lip biometrics
Faraj, Maycel Isaac
Bigun, Josef
ADVANCES IN BIOMETRICS, PROCEEDINGS, 2007, 4642 : 1016 - +

← 1 2 3 4 5 →