Speaker-turn aware diarization for speech-based cognitive assessments

被引:0
|
作者
Xu, Sean Shensheng [1 ]
Ke, Xiaoquan [2 ]
Mak, Man-Wai [2 ]
Wong, Ka Ho [3 ]
Meng, Helen [4 ]
Kwok, Timothy C. Y. [5 ]
Gu, Jason [6 ]
Zhang, Jian [7 ]
Tao, Wei [8 ]
Chang, Chunqi [1 ]
机构
[1] Shenzhen Univ, Med Sch, Sch Biomed Engn, Shenzhen, Peoples R China
[2] Hong Kong Polytech Univ, Dept Elect & Informat Engn, Kowloon, Hong Kong, Peoples R China
[3] Chinese Univ Hong Kong, Dept Syst Engn & Engn Management, Shatin, Hong Kong, Peoples R China
[4] Chinese Univ Hong Kong, Dept Med & Therapeut, Shatin, Hong Kong, Peoples R China
[5] Chinese Univ Hong Kong, Jockey Club Ctr Osteoporosis Care & Control, Shatin, Hong Kong, Peoples R China
[6] Dalhousie Univ, Dept Elect & Comp Engn, Halifax, NS, Canada
[7] Shenzhen Univ Med Sch, Med Sch, Sch Pharm, Shenzhen, Peoples R China
[8] Shenzhen Univ, Dept Neurosurg, South China Hosp, Shenzhen, Peoples R China
基金
中国国家自然科学基金;
关键词
speaker diarization; speaker embedding; comprehensive scoring; speaker-turn timestamps; MOCA; dementia detection; MENTAL-STATE-EXAMINATION; IMPAIRMENT; DEMENTIA; MOCA;
D O I
10.3389/fnins.2023.1351848
中图分类号
Q189 [神经科学];
学科分类号
071006 ;
摘要
Introduction Speaker diarization is an essential preprocessing step for diagnosing cognitive impairments from speech-based Montreal cognitive assessments (MoCA).Methods This paper proposes three enhancements to the conventional speaker diarization methods for such assessments. The enhancements tackle the challenges of diarizing MoCA recordings on two fronts. First, multi-scale channel interdependence speaker embedding is used as the front-end speaker representation for overcoming the acoustic mismatch caused by far-field microphones. Specifically, a squeeze-and-excitation (SE) unit and channel-dependent attention are added to Res2Net blocks for multi-scale feature aggregation. Second, a sequence comparison approach with a holistic view of the whole conversation is applied to measure the similarity of short speech segments in the conversation, which results in a speaker-turn aware scoring matrix for the subsequent clustering step. Third, to further enhance the diarization performance, we propose incorporating a pairwise similarity measure so that the speaker-turn aware scoring matrix contains both local and global information across the segments.Results Evaluations on an interactive MoCA dataset show that the proposed enhancements lead to a diarization system that outperforms the conventional x-vector/PLDA systems under language-, age-, and microphone-mismatch scenarios.Discussion The results also show that the proposed enhancements can help hypothesize the speaker-turn timestamps, making the diarization method amendable to datasets without timestamp information.
引用
收藏
页数:14
相关论文
共 50 条
  • [41] Exploring the Potential of Speech-based Virtual Assistants in Mixed Reality Applications for People with Cognitive Disabilities
    Vona, Francesco
    Torelli, Emanuele
    Beccaluva, Eleonora
    Garzotto, Franca
    PROCEEDINGS OF THE WORKING CONFERENCE ON ADVANCED VISUAL INTERFACES AVI 2020, 2020,
  • [42] Comparing accuracy in voice-based assessments of biological speaker traits across speech types
    Piotr Sorokowski
    Agata Groyecka-Bernard
    Tomasz Frackowiak
    Aleksander Kobylarek
    Piotr Kupczyk
    Agnieszka Sorokowska
    Michał Misiak
    Anna Oleszkiewicz
    Katarzyna Bugaj
    Małgorzata Włodarczyk
    Katarzyna Pisanski
    Scientific Reports, 13
  • [43] Comparing accuracy in voice-based assessments of biological speaker traits across speech types
    Sorokowski, Piotr
    Groyecka-Bernard, Agata
    Frackowiak, Tomasz
    Kobylarek, Aleksander
    Kupczyk, Piotr
    Sorokowska, Agnieszka
    Misiak, Michal
    Oleszkiewicz, Anna
    Bugaj, Katarzyna
    Wlodarczyk, Malgorzata
    Pisanski, Katarzyna
    SCIENTIFIC REPORTS, 2023, 13 (01)
  • [44] Speaker-Aware BERT for Multi-Turn Response Selection in Retrieval-Based Chatbots
    Gu, Jia-Chen
    Li, Tianda
    Liu, Quan
    Ling, Zhen-Hua
    Su, Zhiming
    Wei, Si
    Zhu, Xiaodan
    CIKM '20: PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, 2020, : 2041 - 2044
  • [45] A Speech-Based Mobile Screening Tool for Mild Cognitive Impairment: Technical Performance and User Engagement Evaluation
    Ruzi, Rukiye
    Pan, Yue
    Ng, Menwa Lawrence
    Su, Rongfeng
    Wang, Lan
    Dang, Jianwu
    Liu, Liwei
    Yan, Nan
    BIOENGINEERING-BASEL, 2025, 12 (02):
  • [46] SPEAKER-AWARE TRAINING OF ATTENTION-BASED END-TO-END SPEECH RECOGNITION USING NEURAL SPEAKER EMBEDDINGS
    Rouhe, Aku
    Kaseva, Tuomas
    Kurimo, Mikko
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7064 - 7068
  • [47] AN ONLINE SPEAKER-AWARE SPEECH SEPARATION APPROACH BASED ON TIME-DOMAIN REPRESENTATION
    Wang, Hui
    Song, Yan
    Li, Zeng-Xi
    McLoughlin, Ian
    Dai, Li-Rong
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6379 - 6383
  • [48] Personality-aware Training based Speaker Adaptation for End-to-end Speech Recognition
    Gu, Yue
    Du, Zhihao
    Zhang, Shiliang
    Chen, Qian
    Han, Jiqing
    INTERSPEECH 2023, 2023, : 1249 - 1253
  • [49] IMPROVING SPEECH-BASED END-OF-TURN DETECTION VIA CROSS-MODAL REPRESENTATION LEARNING WITH PUNCTUATED TEXT DATA
    Masumura, Ryo
    Ihori, Mana
    Tanaka, Tomohiro
    Ando, Atsushi
    Ishii, Ryo
    Oba, Takanobu
    Higashinaka, Ryuichiro
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 1062 - 1069
  • [50] Semi-supervised Multichannel Speech Separation Based on a Phone- and Speaker-Aware Deep Generative Model of Speech Spectrograms
    Du, Yicheng
    Sekiguchi, Kouhei
    Bando, Yoshiaki
    Nugraha, Aditya Arie
    Fontaine, Mathieu
    Yoshii, Kazuyoshi
    Kawahara, Tatsuya
    28TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2020), 2021, : 870 - 874