Self-Supervised Training of Speaker Encoder With Multi-Modal Diverse Positive Pairs

被引:3
|
作者
Tao, Ruijie [1 ]
Lee, Kong Aik [2 ,3 ]
Das, Rohan Kumar [4 ]
Hautamaki, Ville [5 ,6 ]
Li, Haizhou [7 ,8 ]
机构
[1] Natl Univ Singapore, Dept Elect & Comp Engineer ing, Singapore 119077, Singapore
[2] Singapore Inst Technol, Singapore 138683, Singapore
[3] ASTAR, Inst Infocomm Res, Singapore 138632, Singapore
[4] Fortemedia, Singapore 138589, Singapore
[5] Natl Univ Singapore, Dept Elect & Comp Engi neering, Singapore 119077, Singapore
[6] Univ Eastern Finland, Sch Comp, Joensuu 80101, Finland
[7] Chinese Univ Hong Kong, Shenzhen Res Inst Big Data, Sch Data Sci, Shenzhen 518172, Peoples R China
[8] Kriston AI, Xiamen 361026, Peoples R China
基金
中国国家自然科学基金;
关键词
Training; Self-supervised learning; Supervised learning; Face recognition; Speaker recognition; Task analysis; Neural networks; speaker recognition; diverse positive pairs; multi-modal; progressive clustering; INFORMATION; EXTRACTION;
D O I
10.1109/TASLP.2023.3268568
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
We study a novel neural speaker encoder and its training strategies for speaker recognition without using any identity labels. The speaker encoder is trained to extract a fixed dimensional speaker embedding from a spoken utterance of variable length. Contrastive learning is a typical self-supervised learning technique. However, the contrastive learning of the speaker encoder depends very much on the sampling strategy of positive and negative pairs. It is common that we sample a positive pair of segments from the same utterance. Unfortunately, such a strategy, denoted as poor-man's positive pairs (PPP), lacks the necessary diversity. In this work, we propose a multi-modal contrastive learning technique with novel sampling strategies. By cross-referencing between speech and face data, we find diverse positive pairs (DPP) for contrastive learning, thus improving the robustness of speaker encoder. We train the speaker encoder on the VoxCeleb2 dataset without any speaker labels, and achieve an equal error rate (EER) of 2.89%, 3.17% and 6.27% under the proposed progressive clustering strategy, and an EER of 1.44%, 1.77% and 3.27% under the two-stage learning strategy with pseudo labels, on the three test sets of VoxCeleb1. This novel solution outperforms the state-of-the-art self-supervised learning methods by a large margin, at the same time, achieves comparable results with the supervised learning counterpart. We also evaluate our self-supervised learning technique on the LRS2 and LRW datasets, where speaker information is unavailable. All experiments suggest that the proposed neural architecture and sampling strategies are robust across datasets.
引用
收藏
页码:1706 / 1719
页数:14
相关论文
共 50 条
  • [1] The Effectiveness of Self-supervised Pre-training for Multi-modal Endometriosis Classification
    Butler, David
    Wang, Hu
    Zhang, Yuan
    To, Minh-Son
    Condous, George
    Leonardi, Mathew
    Knox, Steven
    Avery, Jodie
    Hull, M. Louise
    Carneiro, Gustavo
    2023 45TH ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE & BIOLOGY SOCIETY, EMBC, 2023,
  • [2] Self-supervised multi-modal fusion network for multi-modal thyroid ultrasound image diagnosis
    Xiang, Zhuo
    Zhuo, Qiuluan
    Zhao, Cheng
    Deng, Xiaofei
    Zhu, Ting
    Wang, Tianfu
    Jiang, Wei
    Lei, Baiying
    COMPUTERS IN BIOLOGY AND MEDICINE, 2022, 150
  • [3] Multi-Modal Perception Attention Network with Self-Supervised Learning for Audio-Visual Speaker Tracking
    Li, Yidi
    Liu, Hong
    Tang, Hao
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 1456 - 1463
  • [4] Self-supervised Multi-Modal Video Forgery Attack Detection
    Zhao, Chenhui
    Li, Xiang
    Younes, Rabih
    2023 IEEE WIRELESS COMMUNICATIONS AND NETWORKING CONFERENCE, WCNC, 2023,
  • [5] Self-Supervised Distilled Learning for Multi-modal Misinformation Identification
    Mu, Michael
    Das Bhattacharjee, Sreyasee
    Yuan, Junsong
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 2818 - 2827
  • [6] Self-supervised opinion summarization with multi-modal knowledge graph
    Lingyun Jin
    Jingqiang Chen
    Journal of Intelligent Information Systems, 2024, 62 : 191 - 208
  • [7] Self-supervised opinion summarization with multi-modal knowledge graph
    Jin, Lingyun
    Chen, Jingqiang
    JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2024, 62 (01) : 191 - 208
  • [8] SELF-SUPERVISED LEARNING OF MULTI-MODAL COOPERATION FOR SAR DESPECKLING
    Gaya, Victor
    Dalsasso, Emanuele
    Denis, Loic
    Tupin, Florence
    Pinel-Puyssegur, Beatrice
    Guerin, Cyrielle
    IGARSS 2024-2024 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, IGARSS 2024, 2024, : 2180 - 2183
  • [9] Multi-Modal Mutual Information (MuMMI) Training for Robust Self-Supervised Deep Reinforcement Learning
    Chen, Kaiqi
    Lee, Yong
    Soh, Harold
    2021 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2021), 2021, : 4274 - 4280
  • [10] Self-Supervised Entity Alignment Based on Multi-Modal Contrastive Learning
    Bo Liu
    Ruoyi Song
    Yuejia Xiang
    Junbo Du
    Weijian Ruan
    Jinhui Hu
    IEEE/CAA Journal of Automatica Sinica, 2022, 9 (11) : 2031 - 2033