Active Speaker Detection with Audio-Visual Co-training

被引:12
|
作者
Chakravarty, Punarjay [1 ]
Zegers, Jeroen [2 ]
Tuytelaars, Tinne [1 ]
Van Hamme, Hugo [2 ]
机构
[1] Katholieke Univ Leuven, iMinds MMT, ESAT PSI, Leuven, Belgium
[2] Katholieke Univ Leuven, ESAT PSI, Leuven, Belgium
关键词
Active Speaker Detection; Audio-visual Co-training; RECOGNITION;
D O I
10.1145/2993148.2993172
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this work, we show how to co-train a classifier for active speaker detection using audio-visual data. First, audio Voice Activity Detection (VAD) is used to train a personalized video-based active speaker classifier in a weakly supervised fashion. The video classifier is in turn used to train a voice model for each person. The individual voice models are then used to detect active speakers. There is no manual supervision - audio weakly supervises video classification, and the co-training loop is completed by using the trained video classifier to supervise the training of a personalized audio voice classifier.
引用
收藏
页码:312 / 316
页数:5
相关论文
共 50 条
  • [21] A Bayesian approach to audio-visual speaker identification
    Nefian, AV
    Liang, LH
    Fu, TY
    Liu, XX
    AUDIO-BASED AND VIDEO-BASED BIOMETRIC PERSON AUTHENTICATION, PROCEEDINGS, 2003, 2688 : 761 - 769
  • [22] Deep Audio-Visual Beamforming for Speaker Localization
    Qian, Xinyuan
    Zhang, Qiquan
    Guan, Guohui
    Xue, Wei
    IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 1132 - 1136
  • [23] Speaker independent audio-visual speech recognition
    Zhang, Y
    Levinson, S
    Huang, T
    2000 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, PROCEEDINGS VOLS I-III, 2000, : 1073 - 1076
  • [24] Multi-Speaker Audio-Visual Corpus RUSAVIC: Russian Audio-Visual Speech in Cars
    Ivanko, Denis
    Ryumin, Dmitry
    Axyonov, Alexandr
    Kashevnik, Alexey
    Karpov, Alexey
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 1555 - 1559
  • [25] Multifactor fusion for audio-visual speaker recognition
    Chetty, Girija
    Tran, Dat
    LECTURE NOTES IN SIGNAL SCIENCE, INTERNET AND EDUCATION (SSIP'07/MIV'07/DIWEB'07), 2007, : 70 - +
  • [26] ENVIRONMENTALLY ROBUST AUDIO-VISUAL SPEAKER IDENTIFICATION
    Schoenherr, Lea
    Orth, Dennis
    Heckmann, Martin
    Kolossa, Dorothea
    2016 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2016), 2016, : 312 - 318
  • [27] Audio-visual biometric based speaker identification
    Kar, Biswajit
    Bhatia, Sandeep
    Dutta, P. K.
    ICCIMA 2007: INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND MULTIMEDIA APPLICATIONS, VOL IV, PROCEEDINGS, 2007, : 94 - 98
  • [28] Audio-visual system for robust speaker recognition
    Chen, Q
    Yang, JG
    Gou, J
    MLMTA '05: Proceedings of the International Conference on Machine Learning Models Technologies and Applications, 2005, : 97 - 103
  • [29] Audio-Visual Feature Fusion for Speaker Identification
    Almaadeed, Noor
    Aggoun, Amar
    Amira, Abbes
    NEURAL INFORMATION PROCESSING, ICONIP 2012, PT I, 2012, 7663 : 56 - 67
  • [30] Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization
    Jiang, Hao
    Murdock, Calvin
    Ithapu, Vamsi Krishna
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 10534 - 10542