Target Active Speaker Detection with Audio-visual Cues

被引:2
|
作者
Jiang, Yidi [1 ]
Tao, Ruijie [1 ]
Pan, Zexu [1 ]
Li, Haizhou [1 ,2 ]
机构
[1] Natl Univ Singapore, Singapore, Singapore
[2] Chinese Univ Hong Kong, Sch Data Sci, Shenzhen Res Inst Big Data, Shenzhen, Peoples R China
来源
基金
中国国家自然科学基金;
关键词
Active speaker detection; target speaker; audiovisual; speaker recognition; SPEECH;
D O I
10.21437/Interspeech.2023-574
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In active speaker detection (ASD), we would like to detect whether an on-screen person is speaking based on audio-visual cues. Previous studies have primarily focused on modeling audio-visual synchronization cue, which depends on the video quality of the lip region of a speaker. In real-world applications, it is possible that we can also have the reference speech of the on-screen speaker. To benefit from both facial cue and reference speech, we propose the Target Speaker TalkNet (TS-TalkNet), which leverages a pre-enrolled speaker embedding to complement the audio-visual synchronization cue in detecting whether the target speaker is speaking. Our framework outperforms the popular model, TalkNet on two datasets, achieving absolute improvements of 1.6% in mAP on the AVA-ActiveSpeaker validation set, and 0.8%, 0.4%, and 0.8% in terms of AP, AUC and EER on the ASW test set, respectively. Code is available at https://github.com/Jiang-Yidi/TS-TalkNet/.
引用
收藏
页码:3152 / 3156
页数:5
相关论文
共 50 条
  • [31] Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization
    Jiang, Hao
    Murdock, Calvin
    Ithapu, Vamsi Krishna
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 10534 - 10542
  • [32] Audio-visual speaker identification based on the use of dynamic audio and visual features
    Fox, N
    Reilly, RB
    AUDIO-BASED AND VIDEO-BASED BIOMETRIC PERSON AUTHENTICATION, PROCEEDINGS, 2003, 2688 : 743 - 751
  • [33] Audio-visual integration of emotional cues in song
    Thompson, William Forde
    Russo, Frank A.
    Quinto, Lena
    COGNITION & EMOTION, 2008, 22 (08) : 1457 - 1470
  • [34] Bootstrapping Audio-Visual Video Segmentation by Strengthening Audio Cues
    Chen, Tianxiang
    Tan, Zhentao
    Gong, Tao
    Chu, Qi
    Wu, Yue
    Liu, Bin
    Yu, Nenghai
    Lu, Le
    Ye, Jieping
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2025, 35 (03) : 2398 - 2409
  • [35] Audio-visual Cues for Cloud Service Monitoring
    Bermbach, David
    Eberhardt, Jacob
    CLOSER: PROCEEDINGS OF THE 7TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND SERVICES SCIENCE, 2017, : 439 - 446
  • [36] A Visual Signal Reliability for Robust Audio-Visual Speaker Identification
    Tariquzzaman, Md.
    Kim, Jin Young
    Na, Seung You
    Kim, Hyoung-Gook
    Har, Dongsoo
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2011, E94D (10): : 2052 - 2055
  • [37] Multimodal SpeakerBeam: Single channel target speech extraction with audio-visual speaker clues
    Ochiai, Tsubasa
    Delcroix, Marc
    Kinoshita, Keisuke
    Ogawa, Atsunori
    Nakatani, Tomohiro
    INTERSPEECH 2019, 2019, : 2718 - 2722
  • [38] BEST OF BOTH WORLDS: MULTI-TASK AUDIO-VISUAL AUTOMATIC SPEECH RECOGNITION AND ACTIVE SPEAKER DETECTION
    Braga, Otavio
    Siohan, Olivier
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6047 - 6051
  • [39] Particle Filtering for Bearing-Only Audio-Visual Speaker Detection and Tracking
    Rae, Andrew
    Khamis, Alaa
    Basir, Otman
    Kamel, Mohamed
    2009 3RD INTERNATIONAL CONFERENCE ON SIGNALS, CIRCUITS AND SYSTEMS (SCS 2009), 2009, : 161 - +
  • [40] E-Talk: Accelerating Active Speaker Detection with Audio-Visual Fusion and Edge-Cloud Computing
    Yu, Xiaojing
    Zhang, Lan
    Li, Xiang-yang
    2023 20TH ANNUAL IEEE INTERNATIONAL CONFERENCE ON SENSING, COMMUNICATION, AND NETWORKING, SECON, 2023,