Active Speaker Detection with Audio-Visual Co-training

被引:12
|
作者
Chakravarty, Punarjay [1 ]
Zegers, Jeroen [2 ]
Tuytelaars, Tinne [1 ]
Van Hamme, Hugo [2 ]
机构
[1] Katholieke Univ Leuven, iMinds MMT, ESAT PSI, Leuven, Belgium
[2] Katholieke Univ Leuven, ESAT PSI, Leuven, Belgium
关键词
Active Speaker Detection; Audio-visual Co-training; RECOGNITION;
D O I
10.1145/2993148.2993172
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this work, we show how to co-train a classifier for active speaker detection using audio-visual data. First, audio Voice Activity Detection (VAD) is used to train a personalized video-based active speaker classifier in a weakly supervised fashion. The video classifier is in turn used to train a voice model for each person. The individual voice models are then used to detect active speakers. There is no manual supervision - audio weakly supervises video classification, and the co-training loop is completed by using the trained video classifier to supervise the training of a personalized audio voice classifier.
引用
收藏
页码:312 / 316
页数:5
相关论文
共 50 条
  • [1] AVA ACTIVE SPEAKER: AN AUDIO-VISUAL DATASET FOR ACTIVE SPEAKER DETECTION
    Roth, Joseph
    Chaudhuri, Sourish
    Klejch, Ondrej
    Marvin, Radhika
    Gallagher, Andrew
    Kaver, Liat
    Ramaswamy, Sharadh
    Stopczynski, Arkadiusz
    Schmid, Cordelia
    Xi, Zhonghua
    Pantofaru, Caroline
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 4492 - 4496
  • [2] Target Active Speaker Detection with Audio-visual Cues
    Jiang, Yidi
    Tao, Ruijie
    Pan, Zexu
    Li, Haizhou
    INTERSPEECH 2023, 2023, : 3152 - 3156
  • [3] Improving the Convergence of CO-training for Audio-Visual Person Identification
    Thomsen, Nicolai B.
    Duan, Xiaodong
    Tan, Zheng-Hua
    Lindberg, Borge
    Jensen, Soren Holdt
    2016 FIRST INTERNATIONAL WORKSHOP ON SENSING, PROCESSING AND LEARNING FOR INTELLIGENT MACHINES (SPLINE), 2016,
  • [4] RETHINKING AUDIO-VISUAL SYNCHRONIZATION FOR ACTIVE SPEAKER DETECTION
    Wuerkaixi, Abudukelimu
    Zhang, You
    Duan, Zhiyao
    Zhang, Changshui
    2022 IEEE 32ND INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2022,
  • [5] Active Speaker Detection Using Audio-Visual Sensor Array
    Kheradiya, Jatin
    Reddy, Sandeep C.
    Hegde, Rajesh
    2014 IEEE INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING AND INFORMATION TECHNOLOGY (ISSPIT), 2014, : 480 - 484
  • [6] AS-Net: active speaker detection using deep audio-visual attention
    Radman, Abduljalil
    Laaksonen, Jorma
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (28) : 72027 - 72042
  • [7] OPEN-SET SEMI-SUPERVISED AUDIO-VISUAL SPEAKER RECOGNITION USING CO-TRAINING LDA AND SPARSE REPRESENTATION CLASSIFIERS
    Zhao, Xuran
    Evans, Nicholas
    Dugelay, Jean-Luc
    2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 2999 - 3003
  • [8] Weighted Score Based Fast Converging CO-training with Application to Audio-Visual Person Identification
    Duan, Xiaodong
    Thomsen, Nicolai Baek
    Tan, Zheng-Hua
    Lindberg, Borge
    Jensen, Soren Holdt
    2017 IEEE 29TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2017), 2017, : 610 - 617
  • [9] PUSH-PULL: CHARACTERIZING THE ADVERSARIAL ROBUSTNESS FOR AUDIO-VISUAL ACTIVE SPEAKER DETECTION
    Chen, Xuanjun
    Wu, Haibin
    Meng, Helen
    Lee, Hung-yi
    Jang, Jyh-Shing Roger
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 692 - 699
  • [10] Supplementary Material: AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection
    Roth, Joseph
    Chaudhuri, Sourish
    Klejch, Ondrej
    Marvin, Radhika
    Gallagher, Andrew
    Kaver, Liat
    Ramaswamy, Sharadh
    Stopczynski, Arkadiusz
    Schmid, Cordelia
    Xi, Zhonghua
    Pantofaru, Caroline
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 3718 - 3722