Multimodal speaker segmentation in presence of overlapped speech segments

被引：2

作者：

Rozgic, Viktor ^{[1
]}

Han, Kyu Jeong ^{[1
]}

Georgiou, Panayiotis G. ^{[1
]}

Narayanan, Shrikanth ^{[1
]}

机构：

[1] Univ So Calif, Dept Elect Engn, Speech Anal & Interpretat Lab, Viterbi Sch Engn, Los Angeles, CA 90089 USA

来源：

ISM: 2008 IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA | 2008年

关键词：

D O I：

10.1109/ISM.2008.103

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

We propose a multimodal speaker segmentation algorithm with two main contributions: First, we suggest a hidden Markov model architecture that performs fusion of the three modalities: a multi-camera system for participant localization, a microphone array for speaker localization, and a speaker identification system; Second, we present a novel method for dealing with overlapped speech segments through a likelihood model of the microphone array observations that uses multiple local maxima of the Steered Power Response Generalized Cross Correlation Phase Transform (SPR-GCC-PHAT) function in the Joint Probabilistic Data Association (JPDA) framework. Results show that the proposed method outperforms standard speaker segmentation systems based on: (a) speaker identification and; (b) microphone array processing, for datasets with the significant portion (27.4%) of overlapped speech, and scores as high as 94.4% on the F-measure scale.

引用

页码：679 / 684

页数：6

共 50 条

[41] Improved Speaker Diarization of Meeting Speech with Recurrent Selection of Representative Speech Segments and Participant Interaction Pattern Modeling
Han, Kyu J.
Narayanan, Shrikanth S.
INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 1051 - 1054
[42] Multimodal speaker/speech recognition using lip motion, lip texture and audio
Cetingul, H. E.
Erzin, E.
Yemez, Y.
Tekalp, A. M.
SIGNAL PROCESSING, 2006, 86 (12) : 3549 - 3558
[43] RobinNet: A Multimodal Speech Emotion Recognition System With Speaker Recognition for Social Interactions
Khurana, Yash
Gupta, Swamita
Sathyaraj, R.
Raja, S. P.
IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2022, 11 (01) : 478 - 487
[44] PARAMETRIC REPRESENTATION OF THE SPEAKER'S LIPS FOR MULTIMODAL SIGN LANGUAGE AND SPEECH RECOGNITION
Ryumin, D.
Karpov, A. A.
INTERNATIONAL WORKSHOP PHOTOGRAMMETRIC AND COMPUTER VISION TECHNIQUES FOR VIDEO SURVEILLANCE, BIOMETRICS AND BIOMEDICINE, 2017, 42-2 (W4): : 155 - 161
[45] FRAME LEVEL ENTROPY BASED OVERLAPPED SPEECH DETECTION AS A PRE-PROCESSING STAGE FOR SPEAKER DIARIZATION
Ben-Harush, Oshry
Guterman, Hugo
Lapidot, Itshak
2009 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, 2009, : 321 - +
[46] SPEAKER-INDEPENDENT CLASSIFICATION OF PHONETIC SEGMENTS FROM RAW ULTRASOUND IN CHILD SPEECH
Ribeiro, Manuel Sam
Eshky, Aciel
Richmond, Korin
Renals, Steve
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 1328 - 1332
[47] Speech Signal Segmentation into Vocalized and Unvocalized Segments on the Basis of Simultaneous Masking
Konev, A. A.
Meshcheryakov, R. V.
Kostyuchenko, E. Yu
OPTOELECTRONICS INSTRUMENTATION AND DATA PROCESSING, 2018, 54 (04) : 361 - 366
[48] SPEAKER LOCALIZATION AND TRACKING IN THE PRESENCE OF SOUND INTERFERENCE BY EXPLOITING SPEECH HARMONICITY
Wu, Kai
Goh, Shu Ting
Khong, Andy W. H.
2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 365 - 369
[49] Multimodal Speaker Diarization
Noulas, Athanasios
Englebienne, Gwenn
Krose, Ben J. A.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2012, 34 (01) : 79 - 93
[50] Speech Segmentation and Speaker Diarization using Time-Delay Neural Network
Toruk, Mesut
Serbes, Ahmet
Bilgin, Gokhan
2019 INNOVATIONS IN INTELLIGENT SYSTEMS AND APPLICATIONS CONFERENCE (ASYU), 2019, : 335 - 339

← 1 2 3 4 5 →