Speaker-turn aware diarization for speech-based cognitive assessments

被引:0
|
作者
Xu, Sean Shensheng [1 ]
Ke, Xiaoquan [2 ]
Mak, Man-Wai [2 ]
Wong, Ka Ho [3 ]
Meng, Helen [4 ]
Kwok, Timothy C. Y. [5 ]
Gu, Jason [6 ]
Zhang, Jian [7 ]
Tao, Wei [8 ]
Chang, Chunqi [1 ]
机构
[1] Shenzhen Univ, Med Sch, Sch Biomed Engn, Shenzhen, Peoples R China
[2] Hong Kong Polytech Univ, Dept Elect & Informat Engn, Kowloon, Hong Kong, Peoples R China
[3] Chinese Univ Hong Kong, Dept Syst Engn & Engn Management, Shatin, Hong Kong, Peoples R China
[4] Chinese Univ Hong Kong, Dept Med & Therapeut, Shatin, Hong Kong, Peoples R China
[5] Chinese Univ Hong Kong, Jockey Club Ctr Osteoporosis Care & Control, Shatin, Hong Kong, Peoples R China
[6] Dalhousie Univ, Dept Elect & Comp Engn, Halifax, NS, Canada
[7] Shenzhen Univ Med Sch, Med Sch, Sch Pharm, Shenzhen, Peoples R China
[8] Shenzhen Univ, Dept Neurosurg, South China Hosp, Shenzhen, Peoples R China
基金
中国国家自然科学基金;
关键词
speaker diarization; speaker embedding; comprehensive scoring; speaker-turn timestamps; MOCA; dementia detection; MENTAL-STATE-EXAMINATION; IMPAIRMENT; DEMENTIA; MOCA;
D O I
10.3389/fnins.2023.1351848
中图分类号
Q189 [神经科学];
学科分类号
071006 ;
摘要
Introduction Speaker diarization is an essential preprocessing step for diagnosing cognitive impairments from speech-based Montreal cognitive assessments (MoCA).Methods This paper proposes three enhancements to the conventional speaker diarization methods for such assessments. The enhancements tackle the challenges of diarizing MoCA recordings on two fronts. First, multi-scale channel interdependence speaker embedding is used as the front-end speaker representation for overcoming the acoustic mismatch caused by far-field microphones. Specifically, a squeeze-and-excitation (SE) unit and channel-dependent attention are added to Res2Net blocks for multi-scale feature aggregation. Second, a sequence comparison approach with a holistic view of the whole conversation is applied to measure the similarity of short speech segments in the conversation, which results in a speaker-turn aware scoring matrix for the subsequent clustering step. Third, to further enhance the diarization performance, we propose incorporating a pairwise similarity measure so that the speaker-turn aware scoring matrix contains both local and global information across the segments.Results Evaluations on an interactive MoCA dataset show that the proposed enhancements lead to a diarization system that outperforms the conventional x-vector/PLDA systems under language-, age-, and microphone-mismatch scenarios.Discussion The results also show that the proposed enhancements can help hypothesize the speaker-turn timestamps, making the diarization method amendable to datasets without timestamp information.
引用
收藏
页数:14
相关论文
共 50 条
  • [21] Entropy Based Overlapped Speech Detection as a Pre-Processing Stage for Speaker Diarization
    Ben-Harush, Oshry
    Lapidot, Itshak
    Guterman, Hugo
    INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 908 - +
  • [22] A NOVEL LSTM-BASED SPEECH PREPROCESSOR FOR SPEAKER DIARIZATION IN REALISTIC MISMATCH CONDITIONS
    Sun, Lei
    Du, Jun
    Gao, Tian
    Lu, Yu-Ding
    Tsao, Yu
    Lee, Chin-Hui
    Ryant, Neville
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5234 - 5238
  • [23] Investigating cognitive workload in concurrent speech-based information communication
    Fazal, Muhammad Abu ul
    Ferguson, Sam
    Saeed, Zafar
    International Journal of Human Computer Studies, 2022, 157
  • [24] Investigating cognitive workload in concurrent speech-based information communication
    Abu ul Fazal, Muhammad
    Ferguson, Sam
    Saeed, Zafar
    INTERNATIONAL JOURNAL OF HUMAN-COMPUTER STUDIES, 2022, 157
  • [25] The impact of speech-based assistants on the driver's cognitive distraction
    Loew, Alexandra
    Koniakowsky, Ina
    Forster, Yannick
    Naujoks, Frederik
    Keinath, Andreas
    ACCIDENT ANALYSIS AND PREVENTION, 2023, 179
  • [26] Speech-based Emotion Recognition and Speaker Identification: Static vs. Dynamic Mode of Speech Representation
    Sidorov, Maxim
    Minker, Wolfgang
    Semenkin, Eugene S.
    JOURNAL OF SIBERIAN FEDERAL UNIVERSITY-MATHEMATICS & PHYSICS, 2016, 9 (04): : 518 - 523
  • [27] ENVIRONMENT AWARE SPEAKER DIARIZATION FOR MOVING TARGETS USING PARALLEL DNN-BASED RECOGNIZERS
    Najafian, Maryam
    Hansen, John H. L.
    2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 5450 - 5454
  • [28] Real-time multilingual speech recognition and speaker diarization system based on Whisper segmentation
    Lyu, Ke-Ming
    Lyu, Ren-yuan
    Chang, Hsien-Tsung
    PEERJ COMPUTER SCIENCE, 2024, 10
  • [29] Speaker-aware neural network based beamformer for speaker extraction in speech mixtures
    Zmplikova, Katerina
    Delcroix, Marc
    Kinoshita, Keisuke
    Higuchi, Takuya
    Ogawa, Atsunori
    Nakatani, Tomohiro
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 2655 - 2659
  • [30] INFORMATION BOTTLENECK BASED SPEAKER DIARIZATION OF MEETINGS USING NON-SPEECH AS SIDE INFORMATION
    Yella, Sree Harsha
    Bourlard, Herve
    2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,