Target Active Speaker Detection with Audio-visual Cues

被引：2

作者：

Jiang, Yidi ^{[1
]}

Tao, Ruijie ^{[1
]}

Pan, Zexu ^{[1
]}

Li, Haizhou ^{[1
,2
]}

机构：

[1] Natl Univ Singapore, Singapore, Singapore

[2] Chinese Univ Hong Kong, Sch Data Sci, Shenzhen Res Inst Big Data, Shenzhen, Peoples R China

来源：

INTERSPEECH 2023 | 2023年

基金：

中国国家自然科学基金;

关键词：

Active speaker detection; target speaker; audiovisual; speaker recognition; SPEECH;

D O I：

10.21437/Interspeech.2023-574

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

In active speaker detection (ASD), we would like to detect whether an on-screen person is speaking based on audio-visual cues. Previous studies have primarily focused on modeling audio-visual synchronization cue, which depends on the video quality of the lip region of a speaker. In real-world applications, it is possible that we can also have the reference speech of the on-screen speaker. To benefit from both facial cue and reference speech, we propose the Target Speaker TalkNet (TS-TalkNet), which leverages a pre-enrolled speaker embedding to complement the audio-visual synchronization cue in detecting whether the target speaker is speaking. Our framework outperforms the popular model, TalkNet on two datasets, achieving absolute improvements of 1.6% in mAP on the AVA-ActiveSpeaker validation set, and 0.8%, 0.4%, and 0.8% in terms of AP, AUC and EER on the ASW test set, respectively. Code is available at https://github.com/Jiang-Yidi/TS-TalkNet/.

引用

页码：3152 / 3156

页数：5

共 50 条

[41] Boosting and structure learning in dynamic Bayesian networks for audio-visual speaker detection
Choudhury, T
Rehg, JM
Pavlovic, V
Pentland, A
16TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL III, PROCEEDINGS, 2002, : 789 - 794
[42] The 'Audio-Visual Face Cover Corpus': Investigations into audio-visual speech and speaker recognition when the speaker's face is occluded by facewear
Fecher, Natalie
13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 2247 - 2250
[43] Self-Supervised Moving Vehicle Detection From Audio-Visual Cues
Zuern, Jannik
Burgard, Wolfram
IEEE ROBOTICS AND AUTOMATION LETTERS, 2022, 7 (03) : 7415 - 7422
[44] Audio-Visual Speech Recognition in the Presence of a Competing Speaker
Shao, Xu
Barker, Jon
INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, 2006, : 1292 - 1295
[45] Speaker and digit recognition by audio-visual lip biometrics
Faraj, Maycel Isaac
Bigun, Josef
ADVANCES IN BIOMETRICS, PROCEEDINGS, 2007, 4642 : 1016 - +
[46] Dynamic Bayesian Networks for audio-visual speaker recognition
Li, DD
Yang, YC
Wu, ZH
ADVANCES IN BIOMETRICS, PROCEEDINGS, 2006, 3832 : 539 - 545
[47] Audio-visual speaker identification with asynchronous articulatory feature
Chen, Yanxiang
Liu, M.
ELECTRONICS LETTERS, 2010, 46 (03) : 242 - U77
[48] AUDIO-VISUAL SPEAKER LOCALIZATION VIA WEIGHTED CLUSTERING
Gebru, Israel D.
Alameda-Pineda, Xavier
Horaud, Radu
Forbes, Florence
2014 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2014,
[49] Speaker independent audio-visual continuous speech recognition
Liang, LH
Liu, XX
Zhao, YB
Pi, XB
Nefian, AV
IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOL I AND II, PROCEEDINGS, 2002, : A25 - A28
[50] Audio-visual speaker localization using graphical models
Kushal, Akash
Rahurkar, Mandar
Li Fei-Fei
Ponce, Jean
Huang, Thomas
18TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 1, PROCEEDINGS, 2006, : 291 - +

← 1 2 3 4 5 →