Target Active Speaker Detection with Audio-visual Cues

被引：2

作者：

Jiang, Yidi ^{[1
]}

Tao, Ruijie ^{[1
]}

Pan, Zexu ^{[1
]}

Li, Haizhou ^{[1
,2
]}

机构：

[1] Natl Univ Singapore, Singapore, Singapore

[2] Chinese Univ Hong Kong, Sch Data Sci, Shenzhen Res Inst Big Data, Shenzhen, Peoples R China

来源：

INTERSPEECH 2023 | 2023年

基金：

中国国家自然科学基金;

关键词：

Active speaker detection; target speaker; audiovisual; speaker recognition; SPEECH;

D O I：

10.21437/Interspeech.2023-574

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

In active speaker detection (ASD), we would like to detect whether an on-screen person is speaking based on audio-visual cues. Previous studies have primarily focused on modeling audio-visual synchronization cue, which depends on the video quality of the lip region of a speaker. In real-world applications, it is possible that we can also have the reference speech of the on-screen speaker. To benefit from both facial cue and reference speech, we propose the Target Speaker TalkNet (TS-TalkNet), which leverages a pre-enrolled speaker embedding to complement the audio-visual synchronization cue in detecting whether the target speaker is speaking. Our framework outperforms the popular model, TalkNet on two datasets, achieving absolute improvements of 1.6% in mAP on the AVA-ActiveSpeaker validation set, and 0.8%, 0.4%, and 0.8% in terms of AP, AUC and EER on the ASW test set, respectively. Code is available at https://github.com/Jiang-Yidi/TS-TalkNet/.

引用

页码：3152 / 3156

页数：5

共 50 条

[1] AVA ACTIVE SPEAKER: AN AUDIO-VISUAL DATASET FOR ACTIVE SPEAKER DETECTION
Roth, Joseph
Chaudhuri, Sourish
Klejch, Ondrej
Marvin, Radhika
Gallagher, Andrew
Kaver, Liat
Ramaswamy, Sharadh
Stopczynski, Arkadiusz
Schmid, Cordelia
Xi, Zhonghua
Pantofaru, Caroline
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 4492 - 4496
[2] RETHINKING AUDIO-VISUAL SYNCHRONIZATION FOR ACTIVE SPEAKER DETECTION
Wuerkaixi, Abudukelimu
Zhang, You
Duan, Zhiyao
Zhang, Changshui
2022 IEEE 32ND INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2022,
[3] Rethinking the visual cues in audio-visual speaker extraction
Li, Junjie
Ge, Meng
Pan, Zexu
Cao, Rui
Wang, Longbiao
Dang, Jianwu
Zhang, Shiliang
INTERSPEECH 2023, 2023, : 3754 - 3758
[4] Active Speaker Detection Using Audio-Visual Sensor Array
Kheradiya, Jatin
Reddy, Sandeep C.
Hegde, Rajesh
2014 IEEE INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING AND INFORMATION TECHNOLOGY (ISSPIT), 2014, : 480 - 484
[5] Active Speaker Detection with Audio-Visual Co-training
Chakravarty, Punarjay
Zegers, Jeroen
Tuytelaars, Tinne
Van Hamme, Hugo
ICMI'16: PROCEEDINGS OF THE 18TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2016, : 312 - 316
[6] AS-Net: active speaker detection using deep audio-visual attention
Radman, Abduljalil
Laaksonen, Jorma
MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (28) : 72027 - 72042
[7] Vehicle Detection and Classification using Audio-Visual cues
Piyush, P.
Rajan, Rajeev
Mary, Leena
Koshy, Bino I.
2016 3RD INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND INTEGRATED NETWORKS (SPIN), 2016, : 732 - 736
[8] Object category detection using audio-visual cues
Luo, Jie
Caputo, Barbara
Zweig, Alon
Bach, Joerg-Hendrik
Anemueller, Joern
COMPUTER VISION SYSTEMS, PROCEEDINGS, 2008, 5008 : 539 - 548
[9] PUSH-PULL: CHARACTERIZING THE ADVERSARIAL ROBUSTNESS FOR AUDIO-VISUAL ACTIVE SPEAKER DETECTION
Chen, Xuanjun
Wu, Haibin
Meng, Helen
Lee, Hung-yi
Jang, Jyh-Shing Roger
2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 692 - 699
[10] Supplementary Material: AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection
Roth, Joseph
Chaudhuri, Sourish
Klejch, Ondrej
Marvin, Radhika
Gallagher, Andrew
Kaver, Liat
Ramaswamy, Sharadh
Stopczynski, Arkadiusz
Schmid, Cordelia
Xi, Zhonghua
Pantofaru, Caroline
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 3718 - 3722

← 1 2 3 4 5 →