Active Speaker Detection with Audio-Visual Co-training

被引：12

作者：

Chakravarty, Punarjay ^{[1
]}

Zegers, Jeroen ^{[2
]}

Tuytelaars, Tinne ^{[1
]}

Van Hamme, Hugo ^{[2
]}

机构：

[1] Katholieke Univ Leuven, iMinds MMT, ESAT PSI, Leuven, Belgium

[2] Katholieke Univ Leuven, ESAT PSI, Leuven, Belgium

来源：

ICMI'16: PROCEEDINGS OF THE 18TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION | 2016年

关键词：

Active Speaker Detection; Audio-visual Co-training; RECOGNITION;

D O I：

10.1145/2993148.2993172

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this work, we show how to co-train a classifier for active speaker detection using audio-visual data. First, audio Voice Activity Detection (VAD) is used to train a personalized video-based active speaker classifier in a weakly supervised fashion. The video classifier is in turn used to train a voice model for each person. The individual voice models are then used to detect active speakers. There is no manual supervision - audio weakly supervises video classification, and the co-training loop is completed by using the trained video classifier to supervise the training of a personalized audio voice classifier.

引用

页码：312 / 316

页数：5

共 50 条

[21] A Bayesian approach to audio-visual speaker identification
Nefian, AV
Liang, LH
Fu, TY
Liu, XX
AUDIO-BASED AND VIDEO-BASED BIOMETRIC PERSON AUTHENTICATION, PROCEEDINGS, 2003, 2688 : 761 - 769
[22] Deep Audio-Visual Beamforming for Speaker Localization
Qian, Xinyuan
Zhang, Qiquan
Guan, Guohui
Xue, Wei
IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 1132 - 1136
[23] Speaker independent audio-visual speech recognition
Zhang, Y
Levinson, S
Huang, T
2000 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, PROCEEDINGS VOLS I-III, 2000, : 1073 - 1076
[24] Multi-Speaker Audio-Visual Corpus RUSAVIC: Russian Audio-Visual Speech in Cars
Ivanko, Denis
Ryumin, Dmitry
Axyonov, Alexandr
Kashevnik, Alexey
Karpov, Alexey
LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 1555 - 1559
[25] Multifactor fusion for audio-visual speaker recognition
Chetty, Girija
Tran, Dat
LECTURE NOTES IN SIGNAL SCIENCE, INTERNET AND EDUCATION (SSIP'07/MIV'07/DIWEB'07), 2007, : 70 - +
[26] ENVIRONMENTALLY ROBUST AUDIO-VISUAL SPEAKER IDENTIFICATION
Schoenherr, Lea
Orth, Dennis
Heckmann, Martin
Kolossa, Dorothea
2016 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2016), 2016, : 312 - 318
[27] Audio-visual biometric based speaker identification
Kar, Biswajit
Bhatia, Sandeep
Dutta, P. K.
ICCIMA 2007: INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND MULTIMEDIA APPLICATIONS, VOL IV, PROCEEDINGS, 2007, : 94 - 98
[28] Audio-visual system for robust speaker recognition
Chen, Q
Yang, JG
Gou, J
MLMTA '05: Proceedings of the International Conference on Machine Learning Models Technologies and Applications, 2005, : 97 - 103
[29] Audio-Visual Feature Fusion for Speaker Identification
Almaadeed, Noor
Aggoun, Amar
Amira, Abbes
NEURAL INFORMATION PROCESSING, ICONIP 2012, PT I, 2012, 7663 : 56 - 67
[30] Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization
Jiang, Hao
Murdock, Calvin
Ithapu, Vamsi Krishna
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 10534 - 10542

← 1 2 3 4 5 →