Target Active Speaker Detection with Audio-visual Cues

被引：2

作者：

Jiang, Yidi ^{[1
]}

Tao, Ruijie ^{[1
]}

Pan, Zexu ^{[1
]}

Li, Haizhou ^{[1
,2
]}

机构：

[1] Natl Univ Singapore, Singapore, Singapore

[2] Chinese Univ Hong Kong, Sch Data Sci, Shenzhen Res Inst Big Data, Shenzhen, Peoples R China

来源：

INTERSPEECH 2023 | 2023年

基金：

中国国家自然科学基金;

关键词：

Active speaker detection; target speaker; audiovisual; speaker recognition; SPEECH;

D O I：

10.21437/Interspeech.2023-574

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

In active speaker detection (ASD), we would like to detect whether an on-screen person is speaking based on audio-visual cues. Previous studies have primarily focused on modeling audio-visual synchronization cue, which depends on the video quality of the lip region of a speaker. In real-world applications, it is possible that we can also have the reference speech of the on-screen speaker. To benefit from both facial cue and reference speech, we propose the Target Speaker TalkNet (TS-TalkNet), which leverages a pre-enrolled speaker embedding to complement the audio-visual synchronization cue in detecting whether the target speaker is speaking. Our framework outperforms the popular model, TalkNet on two datasets, achieving absolute improvements of 1.6% in mAP on the AVA-ActiveSpeaker validation set, and 0.8%, 0.4%, and 0.8% in terms of AP, AUC and EER on the ASW test set, respectively. Code is available at https://github.com/Jiang-Yidi/TS-TalkNet/.

引用

页码：3152 / 3156

页数：5

共 50 条

[31] Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization
Jiang, Hao
Murdock, Calvin
Ithapu, Vamsi Krishna
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 10534 - 10542
[32] Audio-visual speaker identification based on the use of dynamic audio and visual features
Fox, N
Reilly, RB
AUDIO-BASED AND VIDEO-BASED BIOMETRIC PERSON AUTHENTICATION, PROCEEDINGS, 2003, 2688 : 743 - 751
[33] Audio-visual integration of emotional cues in song
Thompson, William Forde
Russo, Frank A.
Quinto, Lena
COGNITION & EMOTION, 2008, 22 (08) : 1457 - 1470
[34] Bootstrapping Audio-Visual Video Segmentation by Strengthening Audio Cues
Chen, Tianxiang
Tan, Zhentao
Gong, Tao
Chu, Qi
Wu, Yue
Liu, Bin
Yu, Nenghai
Lu, Le
Ye, Jieping
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2025, 35 (03) : 2398 - 2409
[35] Audio-visual Cues for Cloud Service Monitoring
Bermbach, David
Eberhardt, Jacob
CLOSER: PROCEEDINGS OF THE 7TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND SERVICES SCIENCE, 2017, : 439 - 446
[36] A Visual Signal Reliability for Robust Audio-Visual Speaker Identification
Tariquzzaman, Md.
Kim, Jin Young
Na, Seung You
Kim, Hyoung-Gook
Har, Dongsoo
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2011, E94D (10): : 2052 - 2055
[37] Multimodal SpeakerBeam: Single channel target speech extraction with audio-visual speaker clues
Ochiai, Tsubasa
Delcroix, Marc
Kinoshita, Keisuke
Ogawa, Atsunori
Nakatani, Tomohiro
INTERSPEECH 2019, 2019, : 2718 - 2722
[38] BEST OF BOTH WORLDS: MULTI-TASK AUDIO-VISUAL AUTOMATIC SPEECH RECOGNITION AND ACTIVE SPEAKER DETECTION
Braga, Otavio
Siohan, Olivier
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6047 - 6051
[39] Particle Filtering for Bearing-Only Audio-Visual Speaker Detection and Tracking
Rae, Andrew
Khamis, Alaa
Basir, Otman
Kamel, Mohamed
2009 3RD INTERNATIONAL CONFERENCE ON SIGNALS, CIRCUITS AND SYSTEMS (SCS 2009), 2009, : 161 - +
[40] E-Talk: Accelerating Active Speaker Detection with Audio-Visual Fusion and Edge-Cloud Computing
Yu, Xiaojing
Zhang, Lan
Li, Xiang-yang
2023 20TH ANNUAL IEEE INTERNATIONAL CONFERENCE ON SENSING, COMMUNICATION, AND NETWORKING, SECON, 2023,

← 1 2 3 4 5 →