Target Active Speaker Detection with Audio-visual Cues

被引：2

作者：

Jiang, Yidi ^{[1
]}

Tao, Ruijie ^{[1
]}

Pan, Zexu ^{[1
]}

Li, Haizhou ^{[1
,2
]}

机构：

[1] Natl Univ Singapore, Singapore, Singapore

[2] Chinese Univ Hong Kong, Sch Data Sci, Shenzhen Res Inst Big Data, Shenzhen, Peoples R China

来源：

INTERSPEECH 2023 | 2023年

基金：

中国国家自然科学基金;

关键词：

Active speaker detection; target speaker; audiovisual; speaker recognition; SPEECH;

D O I：

10.21437/Interspeech.2023-574

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

In active speaker detection (ASD), we would like to detect whether an on-screen person is speaking based on audio-visual cues. Previous studies have primarily focused on modeling audio-visual synchronization cue, which depends on the video quality of the lip region of a speaker. In real-world applications, it is possible that we can also have the reference speech of the on-screen speaker. To benefit from both facial cue and reference speech, we propose the Target Speaker TalkNet (TS-TalkNet), which leverages a pre-enrolled speaker embedding to complement the audio-visual synchronization cue in detecting whether the target speaker is speaking. Our framework outperforms the popular model, TalkNet on two datasets, achieving absolute improvements of 1.6% in mAP on the AVA-ActiveSpeaker validation set, and 0.8%, 0.4%, and 0.8% in terms of AP, AUC and EER on the ASW test set, respectively. Code is available at https://github.com/Jiang-Yidi/TS-TalkNet/.

引用

页码：3152 / 3156

页数：5

共 50 条

[21] Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection
Tao, Ruijie
Pan, Zexu
Das, Rohan Kumar
Qian, Xinyuan
Shou, Mike Zheng
Li, Haizhou
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 3927 - 3935
[22] A Bayesian approach to audio-visual speaker identification
Nefian, AV
Liang, LH
Fu, TY
Liu, XX
AUDIO-BASED AND VIDEO-BASED BIOMETRIC PERSON AUTHENTICATION, PROCEEDINGS, 2003, 2688 : 761 - 769
[23] Deep Audio-Visual Beamforming for Speaker Localization
Qian, Xinyuan
Zhang, Qiquan
Guan, Guohui
Xue, Wei
IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 1132 - 1136
[24] Speaker independent audio-visual speech recognition
Zhang, Y
Levinson, S
Huang, T
2000 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, PROCEEDINGS VOLS I-III, 2000, : 1073 - 1076
[25] Multi-Speaker Audio-Visual Corpus RUSAVIC: Russian Audio-Visual Speech in Cars
Ivanko, Denis
Ryumin, Dmitry
Axyonov, Alexandr
Kashevnik, Alexey
Karpov, Alexey
LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 1555 - 1559
[26] Multifactor fusion for audio-visual speaker recognition
Chetty, Girija
Tran, Dat
LECTURE NOTES IN SIGNAL SCIENCE, INTERNET AND EDUCATION (SSIP'07/MIV'07/DIWEB'07), 2007, : 70 - +
[27] ENVIRONMENTALLY ROBUST AUDIO-VISUAL SPEAKER IDENTIFICATION
Schoenherr, Lea
Orth, Dennis
Heckmann, Martin
Kolossa, Dorothea
2016 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2016), 2016, : 312 - 318
[28] Audio-visual biometric based speaker identification
Kar, Biswajit
Bhatia, Sandeep
Dutta, P. K.
ICCIMA 2007: INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND MULTIMEDIA APPLICATIONS, VOL IV, PROCEEDINGS, 2007, : 94 - 98
[29] Audio-visual system for robust speaker recognition
Chen, Q
Yang, JG
Gou, J
MLMTA '05: Proceedings of the International Conference on Machine Learning Models Technologies and Applications, 2005, : 97 - 103
[30] Audio-Visual Feature Fusion for Speaker Identification
Almaadeed, Noor
Aggoun, Amar
Amira, Abbes
NEURAL INFORMATION PROCESSING, ICONIP 2012, PT I, 2012, 7663 : 56 - 67

← 1 2 3 4 5 →