ASoBO: Attentive Beamformer Selection for Distant Speaker Diarization in Meetings

被引:0
|
作者
Mariotte, Theo [1 ,2 ]
Larcher, Anthony [2 ]
Montresori, Silvio [1 ]
Thomas, Jean-Hugh [1 ]
机构
[1] Le Mans Univ, Inst Claude Chappe, LIUM, Le Mans, France
[2] Le Mans Univ, LAUM IA GS UMR CNRS 6613, Le Mans, France
来源
关键词
speaker diarization; distant speech; multimicrophone; explainable AI;
D O I
10.21437/Interspeech.2024-917
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Speaker Diarization (SD) aims at grouping speech segments that belong to the same speaker. This task is required in many speech-processing applications, such as rich meeting transcription. In this context, distant microphone arrays usually capture the audio signal. Beamforming, i.e., spatial filtering, is a common practice to process multi-microphone audio data. However, it often requires an explicit localization of the active source to steer the filter. This paper proposes a self-attention-based algorithm to select the output of a bank of fixed spatial filters. This method serves as a feature extractor for joint Voice Activity (VAD) and Overlapped Speech Detection (OSD). The speaker diarization is then inferred from the detected segments. The approach shows convincing distant VAD, OSD, and SD performance, e.g. 14.5% DER on the AISHELL-4 dataset. The analysis of the self-attention weights demonstrates their explainability, as they correlate with the speaker's angular locations.
引用
收藏
页码:1620 / 1624
页数:5
相关论文
共 50 条
  • [31] Novel Architectures for Unsupervised Information Bottleneck Based Speaker Diarization of Meetings
    Dawalatabad, Nauman
    Madikeri, Srikanth
    Sekhar, C. Chandra
    Murthy, Hema A.
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 14 - 27
  • [32] Multistream speaker diarization of meetings recordings beyond MFCC and TDOA features
    Vijayasenan, Deepu
    Valente, Fabio
    Bourlard, Herve
    SPEECH COMMUNICATION, 2012, 54 (01) : 55 - 67
  • [33] Investigating the Effect of Varying Window Sizes in Speaker Diarization for Meetings Domain
    Naik, Nirali
    Mankad, Sapan H.
    Thakkar, Priyank
    INFORMATION AND COMMUNICATION TECHNOLOGY FOR INTELLIGENT SYSTEMS (ICTIS 2017) - VOL 2, 2018, 84 : 361 - 369
  • [34] Information Bottleneck Features for HMM/GMM Speaker Diarization of Meetings Recordings
    Yella, Sree Harsha
    Valente, Fabio
    12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 960 - 963
  • [35] Estimating Dominance in Multi-Party Meetings Using Speaker Diarization
    Hung, Hayley
    Huang, Yan
    Friedland, Gerald
    Gatica-Perez, Daniel
    IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2011, 19 (04): : 847 - 860
  • [36] Robust speaker diarization for meetings: ICSI RT06S meetings evaluation system
    Anguera, Xavier
    Wooters, Chuck
    Pardo, Jose M.
    MACHINE LEARNING FOR MULTIMODAL INTERACTION, 2006, 4299 : 346 - +
  • [37] The IBM RT07 evaluation systems for speaker diarization on lecture meetings
    Huang, Jing
    Marcheret, Etienne
    Visweswariah, Karthik
    Potamianos, Gerasimos
    MULTIMODAL TECHNOLOGIES FOR PERCEPTION OF HUMANS, 2008, 4625 : 497 - 508
  • [38] PURE SEGMENT SELECTION AS SPEAKER DIARIZATION POST-PROCESSING
    Ben-Harush, Oshry
    Guterman, Hugo
    Lapidot, Itshak
    2008 IEEE 25TH CONVENTION OF ELECTRICAL AND ELECTRONICS ENGINEERS IN ISRAEL, VOLS 1 AND 2, 2008, : 461 - +
  • [39] MIMO Self-attentive RNN Beamformer for Multi-speaker Speech Separation
    Li, Xiyun
    Xu, Yong
    Yu, Meng
    Zhang, Shi-Xiong
    Xu, Jiaming
    Xu, Bo
    Yu, Dong
    INTERSPEECH 2021, 2021, : 1119 - 1123
  • [40] Robust speaker segmentation for meetings:: The ICSI-SRI Spring 2005 Diarization System
    Anguera, X
    Wooters, C
    Peskin, B
    Aguiló, M
    MACHINE LEARNING FOR MULTIMODAL INTERACTION, 2005, 3869 : 402 - 414