Target Active Speaker Detection with Audio-visual Cues

被引:2
|
作者
Jiang, Yidi [1 ]
Tao, Ruijie [1 ]
Pan, Zexu [1 ]
Li, Haizhou [1 ,2 ]
机构
[1] Natl Univ Singapore, Singapore, Singapore
[2] Chinese Univ Hong Kong, Sch Data Sci, Shenzhen Res Inst Big Data, Shenzhen, Peoples R China
来源
基金
中国国家自然科学基金;
关键词
Active speaker detection; target speaker; audiovisual; speaker recognition; SPEECH;
D O I
10.21437/Interspeech.2023-574
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In active speaker detection (ASD), we would like to detect whether an on-screen person is speaking based on audio-visual cues. Previous studies have primarily focused on modeling audio-visual synchronization cue, which depends on the video quality of the lip region of a speaker. In real-world applications, it is possible that we can also have the reference speech of the on-screen speaker. To benefit from both facial cue and reference speech, we propose the Target Speaker TalkNet (TS-TalkNet), which leverages a pre-enrolled speaker embedding to complement the audio-visual synchronization cue in detecting whether the target speaker is speaking. Our framework outperforms the popular model, TalkNet on two datasets, achieving absolute improvements of 1.6% in mAP on the AVA-ActiveSpeaker validation set, and 0.8%, 0.4%, and 0.8% in terms of AP, AUC and EER on the ASW test set, respectively. Code is available at https://github.com/Jiang-Yidi/TS-TalkNet/.
引用
收藏
页码:3152 / 3156
页数:5
相关论文
共 50 条
  • [41] Boosting and structure learning in dynamic Bayesian networks for audio-visual speaker detection
    Choudhury, T
    Rehg, JM
    Pavlovic, V
    Pentland, A
    16TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL III, PROCEEDINGS, 2002, : 789 - 794
  • [42] The 'Audio-Visual Face Cover Corpus': Investigations into audio-visual speech and speaker recognition when the speaker's face is occluded by facewear
    Fecher, Natalie
    13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 2247 - 2250
  • [43] Self-Supervised Moving Vehicle Detection From Audio-Visual Cues
    Zuern, Jannik
    Burgard, Wolfram
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2022, 7 (03) : 7415 - 7422
  • [44] Audio-Visual Speech Recognition in the Presence of a Competing Speaker
    Shao, Xu
    Barker, Jon
    INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, 2006, : 1292 - 1295
  • [45] Speaker and digit recognition by audio-visual lip biometrics
    Faraj, Maycel Isaac
    Bigun, Josef
    ADVANCES IN BIOMETRICS, PROCEEDINGS, 2007, 4642 : 1016 - +
  • [46] Dynamic Bayesian Networks for audio-visual speaker recognition
    Li, DD
    Yang, YC
    Wu, ZH
    ADVANCES IN BIOMETRICS, PROCEEDINGS, 2006, 3832 : 539 - 545
  • [47] Audio-visual speaker identification with asynchronous articulatory feature
    Chen, Yanxiang
    Liu, M.
    ELECTRONICS LETTERS, 2010, 46 (03) : 242 - U77
  • [48] AUDIO-VISUAL SPEAKER LOCALIZATION VIA WEIGHTED CLUSTERING
    Gebru, Israel D.
    Alameda-Pineda, Xavier
    Horaud, Radu
    Forbes, Florence
    2014 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2014,
  • [49] Speaker independent audio-visual continuous speech recognition
    Liang, LH
    Liu, XX
    Zhao, YB
    Pi, XB
    Nefian, AV
    IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOL I AND II, PROCEEDINGS, 2002, : A25 - A28
  • [50] Audio-visual speaker localization using graphical models
    Kushal, Akash
    Rahurkar, Mandar
    Li Fei-Fei
    Ponce, Jean
    Huang, Thomas
    18TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 1, PROCEEDINGS, 2006, : 291 - +