Self-Supervised Training of Speaker Encoder With Multi-Modal Diverse Positive Pairs

被引:3
|
作者
Tao, Ruijie [1 ]
Lee, Kong Aik [2 ,3 ]
Das, Rohan Kumar [4 ]
Hautamaki, Ville [5 ,6 ]
Li, Haizhou [7 ,8 ]
机构
[1] Natl Univ Singapore, Dept Elect & Comp Engineer ing, Singapore 119077, Singapore
[2] Singapore Inst Technol, Singapore 138683, Singapore
[3] ASTAR, Inst Infocomm Res, Singapore 138632, Singapore
[4] Fortemedia, Singapore 138589, Singapore
[5] Natl Univ Singapore, Dept Elect & Comp Engi neering, Singapore 119077, Singapore
[6] Univ Eastern Finland, Sch Comp, Joensuu 80101, Finland
[7] Chinese Univ Hong Kong, Shenzhen Res Inst Big Data, Sch Data Sci, Shenzhen 518172, Peoples R China
[8] Kriston AI, Xiamen 361026, Peoples R China
基金
中国国家自然科学基金;
关键词
Training; Self-supervised learning; Supervised learning; Face recognition; Speaker recognition; Task analysis; Neural networks; speaker recognition; diverse positive pairs; multi-modal; progressive clustering; INFORMATION; EXTRACTION;
D O I
10.1109/TASLP.2023.3268568
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
We study a novel neural speaker encoder and its training strategies for speaker recognition without using any identity labels. The speaker encoder is trained to extract a fixed dimensional speaker embedding from a spoken utterance of variable length. Contrastive learning is a typical self-supervised learning technique. However, the contrastive learning of the speaker encoder depends very much on the sampling strategy of positive and negative pairs. It is common that we sample a positive pair of segments from the same utterance. Unfortunately, such a strategy, denoted as poor-man's positive pairs (PPP), lacks the necessary diversity. In this work, we propose a multi-modal contrastive learning technique with novel sampling strategies. By cross-referencing between speech and face data, we find diverse positive pairs (DPP) for contrastive learning, thus improving the robustness of speaker encoder. We train the speaker encoder on the VoxCeleb2 dataset without any speaker labels, and achieve an equal error rate (EER) of 2.89%, 3.17% and 6.27% under the proposed progressive clustering strategy, and an EER of 1.44%, 1.77% and 3.27% under the two-stage learning strategy with pseudo labels, on the three test sets of VoxCeleb1. This novel solution outperforms the state-of-the-art self-supervised learning methods by a large margin, at the same time, achieves comparable results with the supervised learning counterpart. We also evaluate our self-supervised learning technique on the LRS2 and LRW datasets, where speaker information is unavailable. All experiments suggest that the proposed neural architecture and sampling strategies are robust across datasets.
引用
收藏
页码:1706 / 1719
页数:14
相关论文
共 50 条
  • [41] SS-SSAN: a self-supervised subspace attentional network for multi-modal medical image fusion
    Ying Zhang
    Rencan Nie
    Jinde Cao
    Chaozhen Ma
    Chengchao Wang
    Artificial Intelligence Review, 2023, 56 : 421 - 443
  • [42] SELF-SUPERVISED SPEAKER VERIFICATION WITH SIMPLE SIAMESE NETWORK AND SELF-SUPERVISED REGULARIZATION
    Sang, Mufan
    Li, Haoqi
    Liu, Fang
    Arnold, Andrew O.
    Wan, Li
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6127 - 6131
  • [43] Multi-modal representation learning in retinal imaging using self-supervised learning for enhanced clinical predictions
    Suekei, Emese
    Rumetshofer, Elisabeth
    Schmidinger, Niklas
    Mayr, Andreas
    Schmidt-Erfurth, Ursula
    Klambauer, Guenter
    Bogunovic, Hrvoje
    SCIENTIFIC REPORTS, 2024, 14 (01):
  • [44] MULTI-MODAL SELF-SUPERVISED LEARNING FOR BOOSTING CROP CLASSIFICATION USING SENTINEL2 AND PLANETSCOPE
    Patnala, Ankit
    Stadtler, Scarlet
    Schultz, Martin G.
    Gall, Juergen
    IGARSS 2023 - 2023 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, 2023, : 2223 - 2226
  • [45] Boosting Multi-Modal Unsupervised Domain Adaptation for LiDAR Semantic Segmentation by Self-Supervised Depth Completion
    Cardace, Adriano
    Conti, Andrea
    Ramirez, Pierluigi Zama
    Spezialetti, Riccardo
    Salti, Samuele
    Stefano, Luigi Di
    IEEE ACCESS, 2023, 11 : 85155 - 85164
  • [46] TS-DENet: a transferable self-supervised learning method for multi-modal fluorescence image denoising
    Huang, Liangliang
    Wen, Zhong
    Wang, Zhaokai
    Li, Quanzhi
    Deng, Qilin
    Liu, Xu
    Yang, Qing
    Applied Optics, 2025, 64 (10) : 2534 - 2544
  • [47] SS-SSAN: a self-supervised subspace attentional network for multi-modal medical image fusion
    Zhang, Ying
    Nie, Rencan
    Cao, Jinde
    Ma, Chaozhen
    Wang, Chengchao
    ARTIFICIAL INTELLIGENCE REVIEW, 2023, 56 (SUPPL 1) : 421 - 443
  • [48] A lightning augmented recurrent nowcasting model based on self-supervised learning and multi-modal fusion method
    Zhang, Liang
    Li, Qian
    Zhou, Zeming
    Yang, Kangquan
    ATMOSPHERIC RESEARCH, 2025, 321
  • [49] DialogueBERT: A Self-Supervised Learning based Dialogue Pre-training Encoder
    Zhang, Zhenyu
    Guo, Tao
    Chen, Meng
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021, 2021, : 3647 - 3651
  • [50] Joint Encoder-Decoder Self-Supervised Pre-training for ASR
    Arunkumar, A.
    Umesh, S.
    INTERSPEECH 2022, 2022, : 3418 - 3422