Self-Supervised Training of Speaker Encoder With Multi-Modal Diverse Positive Pairs

被引:3
|
作者
Tao, Ruijie [1 ]
Lee, Kong Aik [2 ,3 ]
Das, Rohan Kumar [4 ]
Hautamaki, Ville [5 ,6 ]
Li, Haizhou [7 ,8 ]
机构
[1] Natl Univ Singapore, Dept Elect & Comp Engineer ing, Singapore 119077, Singapore
[2] Singapore Inst Technol, Singapore 138683, Singapore
[3] ASTAR, Inst Infocomm Res, Singapore 138632, Singapore
[4] Fortemedia, Singapore 138589, Singapore
[5] Natl Univ Singapore, Dept Elect & Comp Engi neering, Singapore 119077, Singapore
[6] Univ Eastern Finland, Sch Comp, Joensuu 80101, Finland
[7] Chinese Univ Hong Kong, Shenzhen Res Inst Big Data, Sch Data Sci, Shenzhen 518172, Peoples R China
[8] Kriston AI, Xiamen 361026, Peoples R China
基金
中国国家自然科学基金;
关键词
Training; Self-supervised learning; Supervised learning; Face recognition; Speaker recognition; Task analysis; Neural networks; speaker recognition; diverse positive pairs; multi-modal; progressive clustering; INFORMATION; EXTRACTION;
D O I
10.1109/TASLP.2023.3268568
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
We study a novel neural speaker encoder and its training strategies for speaker recognition without using any identity labels. The speaker encoder is trained to extract a fixed dimensional speaker embedding from a spoken utterance of variable length. Contrastive learning is a typical self-supervised learning technique. However, the contrastive learning of the speaker encoder depends very much on the sampling strategy of positive and negative pairs. It is common that we sample a positive pair of segments from the same utterance. Unfortunately, such a strategy, denoted as poor-man's positive pairs (PPP), lacks the necessary diversity. In this work, we propose a multi-modal contrastive learning technique with novel sampling strategies. By cross-referencing between speech and face data, we find diverse positive pairs (DPP) for contrastive learning, thus improving the robustness of speaker encoder. We train the speaker encoder on the VoxCeleb2 dataset without any speaker labels, and achieve an equal error rate (EER) of 2.89%, 3.17% and 6.27% under the proposed progressive clustering strategy, and an EER of 1.44%, 1.77% and 3.27% under the two-stage learning strategy with pseudo labels, on the three test sets of VoxCeleb1. This novel solution outperforms the state-of-the-art self-supervised learning methods by a large margin, at the same time, achieves comparable results with the supervised learning counterpart. We also evaluate our self-supervised learning technique on the LRS2 and LRW datasets, where speaker information is unavailable. All experiments suggest that the proposed neural architecture and sampling strategies are robust across datasets.
引用
收藏
页码:1706 / 1719
页数:14
相关论文
共 50 条
  • [31] Multi-modal emotion recognition using tensor decomposition fusion and self-supervised multi-tasking
    Wang, Rui
    Zhu, Jiawei
    Wang, Shoujin
    Wang, Tao
    Huang, Jingze
    Zhu, Xianxun
    INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2024, 13 (04)
  • [32] Self-supervised Speaker Diarization
    Dissen, Yehoshua
    Kreuk, Felix
    Keshet, Joseph
    INTERSPEECH 2022, 2022, : 4013 - 4017
  • [33] Self-supervised speaker embeddings
    Stafylakis, Themos
    Rohdin, Johan
    Plchot, Oldrich
    Mizera, Petr
    Burget, Lukas
    INTERSPEECH 2019, 2019, : 2863 - 2867
  • [34] Augmentation Adversarial Training for Self-Supervised Speaker Representation Learning
    Kang, Jingu
    Huh, Jaesung
    Heo, Hee Soo
    Chung, Joon Son
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) : 1253 - 1262
  • [35] Heterogeneous self-supervised interest point matching for multi-modal remote sensing image registration
    Zhao, Ming
    Zhang, Guixiang
    Ding, Min
    INTERNATIONAL JOURNAL OF REMOTE SENSING, 2022, 43 (03) : 915 - 931
  • [36] AUDIO-VISUAL SPEECH ENHANCEMENT AND SEPARATION BY UTILIZING MULTI-MODAL SELF-SUPERVISED EMBEDDINGS
    Chern, I-Chun
    Hung, Kuo-Hsuan
    Chen, Yi-Ting
    Hussain, Tassadaq
    Gogate, Mandar
    Hussain, Amir
    Tsao, Yu
    Hou, Jen-Cheng
    2023 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW, 2023,
  • [37] Towards Multi-modal Self-supervised Video and Ultrasound Pose Estimation for Laparoscopic Liver Surgery
    Montana-Brown, Nina
    Ramalhinho, Joao
    Koo, Bongjin
    Allam, Moustafa
    Davidson, Brian
    Gurusamy, Kurinchi
    Hu, Yipeng
    Clarkson, Matthew J.
    SIMPLIFYING MEDICAL ULTRASOUND, ASMUS 2022, 2022, 13565 : 183 - 192
  • [38] Continual Self-supervised Learning: Towards Universal Multi-modal Medical Data Representation Learning
    Ye, Yiwen
    Xie, Yutong
    Zhang, Jianpeng
    Chen, Ziyang
    Wu, Qi
    Xia, Yong
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 11114 - 11124
  • [39] Self-Supervised Feature Learning via Exploiting Multi-Modal Data for Retinal Disease Diagnosis
    Li, Xiaomeng
    Jia, Mengyu
    Islam, Md Tauhidul
    Yu, Lequan
    Xing, Lei
    IEEE TRANSACTIONS ON MEDICAL IMAGING, 2020, 39 (12) : 4023 - 4033
  • [40] Self-supervised contrastive speaker verification with nearest neighbor positive instances
    Liu, Yan
    Wei, Li-Fang
    Zhang, Chuan-Fei
    Zhang, Tian-Hao
    Chen, Song-Lu
    Yin, Xu-Cheng
    PATTERN RECOGNITION LETTERS, 2023, 173 : 17 - 22