DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning

被引:0
|
作者
Liu, Alexander H. [1 ]
Chang, Heng-Jui [1 ]
Auli, Michael [2 ]
Hsu, Wei-Ning [2 ]
Glass, James [1 ]
机构
[1] MIT, CSAIL, Cambridge, MA 02139 USA
[2] Meta AI, New York, NY USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we introduce self-distillation and online clustering for self-supervised speech representation learning (DinoSR) which combines masked language modeling, self-distillation, and online clustering. We show that these concepts complement each other and result in a strong representation learning model for speech. DinoSR first extracts contextualized embeddings from the input audio with a teacher network, then runs an online clustering system on the embeddings to yield a machine-discovered phone inventory, and finally uses the discretized tokens to guide a student network. We show that DinoSR surpasses previous state-of-the-art performance in several downstream tasks, and provide a detailed analysis of the model and the learned discrete units. Code available at https://github.com/Alexander-H- Liu/dinosr.
引用
收藏
页数:17
相关论文
共 50 条
  • [41] Self-supervised Representation Learning on Document Images
    Cosma, Adrian
    Ghidoveanu, Mihai
    Panaitescu-Liess, Michael
    Popescu, Marius
    DOCUMENT ANALYSIS SYSTEMS, 2020, 12116 : 103 - 117
  • [42] Distilling Localization for Self-Supervised Representation Learning
    Zhao, Nanxuan
    Wu, Zhirong
    Lau, Rynson W. H.
    Lin, Stephen
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 10990 - 10998
  • [43] Adaptive Self-Supervised Graph Representation Learning
    Gong, Yunchi
    36TH INTERNATIONAL CONFERENCE ON INFORMATION NETWORKING (ICOIN 2022), 2022, : 254 - 259
  • [44] Context Autoencoder for Self-supervised Representation Learning
    Chen, Xiaokang
    Ding, Mingyu
    Wang, Xiaodi
    Xin, Ying
    Mo, Shentong
    Wang, Yunhao
    Han, Shumin
    Luo, Ping
    Zeng, Gang
    Wang, Jingdong
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2023, 132 (1) : 208 - 223
  • [45] SELF-SUPERVISED REPRESENTATION LEARNING FOR ULTRASOUND VIDEO
    Jiao, Jianbo
    Droste, Richard
    Drukker, Lior
    Papageorghiou, Aris T.
    Noble, J. Alison
    2020 IEEE 17TH INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING (ISBI 2020), 2020, : 1847 - 1850
  • [46] Emotion-Aware Speech Self-Supervised Representation Learning with Intensity Knowledge
    Liu, Rui
    Ma, Zening
    INTERSPEECH 2024, 2024, : 3180 - 3184
  • [47] Context Autoencoder for Self-supervised Representation Learning
    Xiaokang Chen
    Mingyu Ding
    Xiaodi Wang
    Ying Xin
    Shentong Mo
    Yunhao Wang
    Shumin Han
    Ping Luo
    Gang Zeng
    Jingdong Wang
    International Journal of Computer Vision, 2024, 132 : 208 - 223
  • [48] LeBenchmark: A Reproducible Framework for Assessing Self-Supervised Representation Learning from Speech
    Evain, Solene
    Ha Nguyen
    Hang Le
    Boito, Marcely Zanon
    Mdhaffar, Salima
    Alisamir, Sina
    Tong, Ziyi
    Tomashenko, Natalia
    Dinarelli, Marco
    Parcollet, Titouan
    Allauzen, Alexandre
    Esteve, Yannick
    Lecouteux, Benjamin
    Portet, Francois
    Rossato, Solange
    Ringeval, Fabien
    Schwab, Didier
    Besacier, Laurent
    INTERSPEECH 2021, 2021, : 1439 - 1443
  • [49] Revisiting Self-Supervised Visual Representation Learning
    Kolesnikov, Alexander
    Zhai, Xiaohua
    Beyer, Lucas
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 1920 - 1929
  • [50] SelfDoc: Self-Supervised Document Representation Learning
    Li, Peizhao
    Gu, Jiuxiang
    Kuen, Jason
    Morariu, Vlad, I
    Zhao, Handong
    Jain, Rajiv
    Manjunatha, Varun
    Liu, Hongfu
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 5648 - 5656