Self-Supervised Training of Speaker Encoder With Multi-Modal Diverse Positive Pairs

被引:3
|
作者
Tao, Ruijie [1 ]
Lee, Kong Aik [2 ,3 ]
Das, Rohan Kumar [4 ]
Hautamaki, Ville [5 ,6 ]
Li, Haizhou [7 ,8 ]
机构
[1] Natl Univ Singapore, Dept Elect & Comp Engineer ing, Singapore 119077, Singapore
[2] Singapore Inst Technol, Singapore 138683, Singapore
[3] ASTAR, Inst Infocomm Res, Singapore 138632, Singapore
[4] Fortemedia, Singapore 138589, Singapore
[5] Natl Univ Singapore, Dept Elect & Comp Engi neering, Singapore 119077, Singapore
[6] Univ Eastern Finland, Sch Comp, Joensuu 80101, Finland
[7] Chinese Univ Hong Kong, Shenzhen Res Inst Big Data, Sch Data Sci, Shenzhen 518172, Peoples R China
[8] Kriston AI, Xiamen 361026, Peoples R China
基金
中国国家自然科学基金;
关键词
Training; Self-supervised learning; Supervised learning; Face recognition; Speaker recognition; Task analysis; Neural networks; speaker recognition; diverse positive pairs; multi-modal; progressive clustering; INFORMATION; EXTRACTION;
D O I
10.1109/TASLP.2023.3268568
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
We study a novel neural speaker encoder and its training strategies for speaker recognition without using any identity labels. The speaker encoder is trained to extract a fixed dimensional speaker embedding from a spoken utterance of variable length. Contrastive learning is a typical self-supervised learning technique. However, the contrastive learning of the speaker encoder depends very much on the sampling strategy of positive and negative pairs. It is common that we sample a positive pair of segments from the same utterance. Unfortunately, such a strategy, denoted as poor-man's positive pairs (PPP), lacks the necessary diversity. In this work, we propose a multi-modal contrastive learning technique with novel sampling strategies. By cross-referencing between speech and face data, we find diverse positive pairs (DPP) for contrastive learning, thus improving the robustness of speaker encoder. We train the speaker encoder on the VoxCeleb2 dataset without any speaker labels, and achieve an equal error rate (EER) of 2.89%, 3.17% and 6.27% under the proposed progressive clustering strategy, and an EER of 1.44%, 1.77% and 3.27% under the two-stage learning strategy with pseudo labels, on the three test sets of VoxCeleb1. This novel solution outperforms the state-of-the-art self-supervised learning methods by a large margin, at the same time, achieves comparable results with the supervised learning counterpart. We also evaluate our self-supervised learning technique on the LRS2 and LRW datasets, where speaker information is unavailable. All experiments suggest that the proposed neural architecture and sampling strategies are robust across datasets.
引用
收藏
页码:1706 / 1719
页数:14
相关论文
共 50 条
  • [21] Once and for All: Self-supervised Multi-modal Co-training on One-billion Videos at Alibaba
    Huang, Lianghua
    Liu, Yu
    Zhou, Xiangzeng
    You, Ansheng
    Li, Ming
    Wang, Bin
    Zhang, Yingya
    Pan, Pan
    Xu, Yinghui
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 1148 - 1156
  • [22] JOINT MULTI-MODAL SELF-SUPERVISED PRE-TRAINING IN REMOTE SENSING: APPLICATION TO METHANE SOURCE CLASSIFICATION
    Berg, Paul
    Pham, Minh-Tan
    Courty, Nicolas
    IGARSS 2023 - 2023 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, 2023, : 6624 - 6627
  • [23] Multi-label remote sensing classification with self-supervised gated multi-modal transformers
    Liu, Na
    Yuan, Ye
    Wu, Guodong
    Zhang, Sai
    Leng, Jie
    Wan, Lihong
    FRONTIERS IN COMPUTATIONAL NEUROSCIENCE, 2024, 18
  • [24] A self-supervised building extraction method based on multi-modal remote sensing data
    Qu, Yunhao
    Wang, Chang
    REMOTE SENSING LETTERS, 2025, 16 (01) : 77 - 88
  • [25] Self-supervised 3D Patient Modeling with Multi-modal Attentive Fusion
    Zheng, Meng
    Planche, Benjamin
    Gong, Xuan
    Yang, Fan
    Chen, Terrence
    Wu, Ziyan
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2022, PT VII, 2022, 13437 : 115 - 125
  • [26] Self-supervised multi-modal feature fusion for predicting early recurrence of hepatocellular carcinoma
    Wang, Sen
    Zhao, Ying
    Li, Jiayi
    Yi, Zongmin
    Li, Jun
    Zuo, Can
    Yao, Yu
    Liu, Ailian
    COMPUTERIZED MEDICAL IMAGING AND GRAPHICS, 2024, 118
  • [27] Self-Supervised Depth Completion Based on Multi-Modal Spatio-Temporal Consistency
    Zhang, Quan
    Chen, Xiaoyu
    Wang, Xingguo
    Han, Jing
    Zhang, Yi
    Yue, Jiang
    REMOTE SENSING, 2023, 15 (01)
  • [28] Self-supervised learning based multi-modal intra-hour irradiance forecast
    Shan, Shuo
    Dou, Weijin
    Zhang, Kanjian
    Wei, Haikun
    2023 35TH CHINESE CONTROL AND DECISION CONFERENCE, CCDC, 2023, : 2549 - 2553
  • [29] Exploring Self-Supervised Learning for Multi-Modal Remote Sensing Pre-Training via Asymmetric Attention Fusion
    Xu, Guozheng
    Jiang, Xue
    Li, Xiangtai
    Zhang, Ze
    Liu, Xingzhao
    REMOTE SENSING, 2023, 15 (24)
  • [30] MULTI-MODAL SELF-SUPERVISED PRE-TRAINING FOR JOINT OPTIC DISC AND CUP SEGMENTATION IN EYE FUNDUS IMAGES
    Hervella, Alvaro S.
    Ramos, Lucia
    Rouco, Jose
    Novo, Jorge
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 961 - 965