Target Active Speaker Detection with Audio-visual Cues

被引:2
|
作者
Jiang, Yidi [1 ]
Tao, Ruijie [1 ]
Pan, Zexu [1 ]
Li, Haizhou [1 ,2 ]
机构
[1] Natl Univ Singapore, Singapore, Singapore
[2] Chinese Univ Hong Kong, Sch Data Sci, Shenzhen Res Inst Big Data, Shenzhen, Peoples R China
来源
基金
中国国家自然科学基金;
关键词
Active speaker detection; target speaker; audiovisual; speaker recognition; SPEECH;
D O I
10.21437/Interspeech.2023-574
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In active speaker detection (ASD), we would like to detect whether an on-screen person is speaking based on audio-visual cues. Previous studies have primarily focused on modeling audio-visual synchronization cue, which depends on the video quality of the lip region of a speaker. In real-world applications, it is possible that we can also have the reference speech of the on-screen speaker. To benefit from both facial cue and reference speech, we propose the Target Speaker TalkNet (TS-TalkNet), which leverages a pre-enrolled speaker embedding to complement the audio-visual synchronization cue in detecting whether the target speaker is speaking. Our framework outperforms the popular model, TalkNet on two datasets, achieving absolute improvements of 1.6% in mAP on the AVA-ActiveSpeaker validation set, and 0.8%, 0.4%, and 0.8% in terms of AP, AUC and EER on the ASW test set, respectively. Code is available at https://github.com/Jiang-Yidi/TS-TalkNet/.
引用
收藏
页码:3152 / 3156
页数:5
相关论文
共 50 条
  • [1] AVA ACTIVE SPEAKER: AN AUDIO-VISUAL DATASET FOR ACTIVE SPEAKER DETECTION
    Roth, Joseph
    Chaudhuri, Sourish
    Klejch, Ondrej
    Marvin, Radhika
    Gallagher, Andrew
    Kaver, Liat
    Ramaswamy, Sharadh
    Stopczynski, Arkadiusz
    Schmid, Cordelia
    Xi, Zhonghua
    Pantofaru, Caroline
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 4492 - 4496
  • [2] RETHINKING AUDIO-VISUAL SYNCHRONIZATION FOR ACTIVE SPEAKER DETECTION
    Wuerkaixi, Abudukelimu
    Zhang, You
    Duan, Zhiyao
    Zhang, Changshui
    2022 IEEE 32ND INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2022,
  • [3] Rethinking the visual cues in audio-visual speaker extraction
    Li, Junjie
    Ge, Meng
    Pan, Zexu
    Cao, Rui
    Wang, Longbiao
    Dang, Jianwu
    Zhang, Shiliang
    INTERSPEECH 2023, 2023, : 3754 - 3758
  • [4] Active Speaker Detection Using Audio-Visual Sensor Array
    Kheradiya, Jatin
    Reddy, Sandeep C.
    Hegde, Rajesh
    2014 IEEE INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING AND INFORMATION TECHNOLOGY (ISSPIT), 2014, : 480 - 484
  • [5] Active Speaker Detection with Audio-Visual Co-training
    Chakravarty, Punarjay
    Zegers, Jeroen
    Tuytelaars, Tinne
    Van Hamme, Hugo
    ICMI'16: PROCEEDINGS OF THE 18TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2016, : 312 - 316
  • [6] AS-Net: active speaker detection using deep audio-visual attention
    Radman, Abduljalil
    Laaksonen, Jorma
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (28) : 72027 - 72042
  • [7] Vehicle Detection and Classification using Audio-Visual cues
    Piyush, P.
    Rajan, Rajeev
    Mary, Leena
    Koshy, Bino I.
    2016 3RD INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND INTEGRATED NETWORKS (SPIN), 2016, : 732 - 736
  • [8] Object category detection using audio-visual cues
    Luo, Jie
    Caputo, Barbara
    Zweig, Alon
    Bach, Joerg-Hendrik
    Anemueller, Joern
    COMPUTER VISION SYSTEMS, PROCEEDINGS, 2008, 5008 : 539 - 548
  • [9] PUSH-PULL: CHARACTERIZING THE ADVERSARIAL ROBUSTNESS FOR AUDIO-VISUAL ACTIVE SPEAKER DETECTION
    Chen, Xuanjun
    Wu, Haibin
    Meng, Helen
    Lee, Hung-yi
    Jang, Jyh-Shing Roger
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 692 - 699
  • [10] Supplementary Material: AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection
    Roth, Joseph
    Chaudhuri, Sourish
    Klejch, Ondrej
    Marvin, Radhika
    Gallagher, Andrew
    Kaver, Liat
    Ramaswamy, Sharadh
    Stopczynski, Arkadiusz
    Schmid, Cordelia
    Xi, Zhonghua
    Pantofaru, Caroline
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 3718 - 3722