Multimodal speaker segmentation in presence of overlapped speech segments

被引:2
|
作者
Rozgic, Viktor [1 ]
Han, Kyu Jeong [1 ]
Georgiou, Panayiotis G. [1 ]
Narayanan, Shrikanth [1 ]
机构
[1] Univ So Calif, Dept Elect Engn, Speech Anal & Interpretat Lab, Viterbi Sch Engn, Los Angeles, CA 90089 USA
关键词
D O I
10.1109/ISM.2008.103
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We propose a multimodal speaker segmentation algorithm with two main contributions: First, we suggest a hidden Markov model architecture that performs fusion of the three modalities: a multi-camera system for participant localization, a microphone array for speaker localization, and a speaker identification system; Second, we present a novel method for dealing with overlapped speech segments through a likelihood model of the microphone array observations that uses multiple local maxima of the Steered Power Response Generalized Cross Correlation Phase Transform (SPR-GCC-PHAT) function in the Joint Probabilistic Data Association (JPDA) framework. Results show that the proposed method outperforms standard speaker segmentation systems based on: (a) speaker identification and; (b) microphone array processing, for datasets with the significant portion (27.4%) of overlapped speech, and scores as high as 94.4% on the F-measure scale.
引用
收藏
页码:679 / 684
页数:6
相关论文
共 50 条
  • [41] Improved Speaker Diarization of Meeting Speech with Recurrent Selection of Representative Speech Segments and Participant Interaction Pattern Modeling
    Han, Kyu J.
    Narayanan, Shrikanth S.
    INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 1051 - 1054
  • [42] Multimodal speaker/speech recognition using lip motion, lip texture and audio
    Cetingul, H. E.
    Erzin, E.
    Yemez, Y.
    Tekalp, A. M.
    SIGNAL PROCESSING, 2006, 86 (12) : 3549 - 3558
  • [43] RobinNet: A Multimodal Speech Emotion Recognition System With Speaker Recognition for Social Interactions
    Khurana, Yash
    Gupta, Swamita
    Sathyaraj, R.
    Raja, S. P.
    IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2022, 11 (01) : 478 - 487
  • [44] PARAMETRIC REPRESENTATION OF THE SPEAKER'S LIPS FOR MULTIMODAL SIGN LANGUAGE AND SPEECH RECOGNITION
    Ryumin, D.
    Karpov, A. A.
    INTERNATIONAL WORKSHOP PHOTOGRAMMETRIC AND COMPUTER VISION TECHNIQUES FOR VIDEO SURVEILLANCE, BIOMETRICS AND BIOMEDICINE, 2017, 42-2 (W4): : 155 - 161
  • [45] FRAME LEVEL ENTROPY BASED OVERLAPPED SPEECH DETECTION AS A PRE-PROCESSING STAGE FOR SPEAKER DIARIZATION
    Ben-Harush, Oshry
    Guterman, Hugo
    Lapidot, Itshak
    2009 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, 2009, : 321 - +
  • [46] SPEAKER-INDEPENDENT CLASSIFICATION OF PHONETIC SEGMENTS FROM RAW ULTRASOUND IN CHILD SPEECH
    Ribeiro, Manuel Sam
    Eshky, Aciel
    Richmond, Korin
    Renals, Steve
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 1328 - 1332
  • [47] Speech Signal Segmentation into Vocalized and Unvocalized Segments on the Basis of Simultaneous Masking
    Konev, A. A.
    Meshcheryakov, R. V.
    Kostyuchenko, E. Yu
    OPTOELECTRONICS INSTRUMENTATION AND DATA PROCESSING, 2018, 54 (04) : 361 - 366
  • [48] SPEAKER LOCALIZATION AND TRACKING IN THE PRESENCE OF SOUND INTERFERENCE BY EXPLOITING SPEECH HARMONICITY
    Wu, Kai
    Goh, Shu Ting
    Khong, Andy W. H.
    2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 365 - 369
  • [49] Multimodal Speaker Diarization
    Noulas, Athanasios
    Englebienne, Gwenn
    Krose, Ben J. A.
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2012, 34 (01) : 79 - 93
  • [50] Speech Segmentation and Speaker Diarization using Time-Delay Neural Network
    Toruk, Mesut
    Serbes, Ahmet
    Bilgin, Gokhan
    2019 INNOVATIONS IN INTELLIGENT SYSTEMS AND APPLICATIONS CONFERENCE (ASYU), 2019, : 335 - 339