Active Speaker Detection with Audio-Visual Co-training

被引：12

作者：

Chakravarty, Punarjay ^{[1
]}

Zegers, Jeroen ^{[2
]}

Tuytelaars, Tinne ^{[1
]}

Van Hamme, Hugo ^{[2
]}

机构：

[1] Katholieke Univ Leuven, iMinds MMT, ESAT PSI, Leuven, Belgium

[2] Katholieke Univ Leuven, ESAT PSI, Leuven, Belgium

来源：

ICMI'16: PROCEEDINGS OF THE 18TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION | 2016年

关键词：

Active Speaker Detection; Audio-visual Co-training; RECOGNITION;

D O I：

10.1145/2993148.2993172

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this work, we show how to co-train a classifier for active speaker detection using audio-visual data. First, audio Voice Activity Detection (VAD) is used to train a personalized video-based active speaker classifier in a weakly supervised fashion. The video classifier is in turn used to train a voice model for each person. The individual voice models are then used to detect active speakers. There is no manual supervision - audio weakly supervises video classification, and the co-training loop is completed by using the trained video classifier to supervise the training of a personalized audio voice classifier.

引用

页码：312 / 316

页数：5

共 50 条

[41] Dynamic Bayesian Networks for audio-visual speaker recognition
Li, DD
Yang, YC
Wu, ZH
ADVANCES IN BIOMETRICS, PROCEEDINGS, 2006, 3832 : 539 - 545
[42] Audio-visual speaker identification with asynchronous articulatory feature
Chen, Yanxiang
Liu, M.
ELECTRONICS LETTERS, 2010, 46 (03) : 242 - U77
[43] AUDIO-VISUAL SPEAKER LOCALIZATION VIA WEIGHTED CLUSTERING
Gebru, Israel D.
Alameda-Pineda, Xavier
Horaud, Radu
Forbes, Florence
2014 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2014,
[44] Speaker independent audio-visual continuous speech recognition
Liang, LH
Liu, XX
Zhao, YB
Pi, XB
Nefian, AV
IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOL I AND II, PROCEEDINGS, 2002, : A25 - A28
[45] Audio-visual speaker localization using graphical models
Kushal, Akash
Rahurkar, Mandar
Li Fei-Fei
Ponce, Jean
Huang, Thomas
18TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 1, PROCEEDINGS, 2006, : 291 - +
[46] Dynamic dependency tests for audio-visual speaker association
Siracusa, Michael R.
Fisher, John W., III
2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL II, PTS 1-3, 2007, : 457 - +
[47] Audio-visual speaker recognition for video broadcast news
Maison, B
Neti, C
Senior, A
JOURNAL OF VLSI SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, 2001, 29 (1-2): : 71 - 79
[48] Audio-visual speaker tracking with importance particle filters
Gatica-Perez, D
Lathoud, G
McCowan, I
Odobez, JM
Moore, D
2003 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOL 3, PROCEEDINGS, 2003, : 25 - 28
[49] Audio-Visual Multilevel Fusion for Speech and Speaker Recognition
Chetty, Girija
Wagner, Michael
INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 379 - 382
[50] Audio-visual event detection based on mining of semantic audio-visual labels
Goh, KS
Miyahara, K
Radhakrishan, R
Xiong, ZY
Divakaran, A
STORAGE AND RETRIEVAL METHODS AND APPLICATIONS FOR MULTIMEDIA 2004, 2004, 5307 : 292 - 299

← 1 2 3 4 5 →