Masked Modeling Duo for Speech: Specializing General-Purpose Audio Representation to Speech using Denoising Distillation

被引:0
|
作者
Niizumi, Daisuke [1 ]
Takeuchi, Daiki [1 ]
Ohishi, Yasunori [1 ]
Harada, Noboru [1 ]
Kashino, Kunio [1 ]
机构
[1] NTT Corp, Tokyo, Japan
来源
INTERSPEECH 2023 | 2023年
关键词
speech representation learning; general-purpose audio representation; denoising; distillation; specialization;
D O I
10.21437/Interspeech.2023-221
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Self-supervised learning general-purpose audio representations have demonstrated high performance in a variety of tasks. Although they can be optimized for application by fine-tuning, even higher performance can be expected if they can be specialized to pre-train for an application. This paper explores the challenges and solutions in specializing general-purpose audio representations for a specific application using speech, a highly demanding field, as an example. We enhance Masked Modeling Duo (M2D), a general-purpose model, to close the performance gap with state-of-the-art (SOTA) speech models. To do so, we propose a new task, denoising distillation, to learn from fine-grained clustered features, and M2D for Speech (M2D-S), which jointly learns the denoising distillation task and M2D masked prediction task. Experimental results show that M2D-S performs comparably to or outperforms SOTA speech models on the SUPERB benchmark, demonstrating that M2D can specialize in a demanding field.
引用
收藏
页码:1294 / 1298
页数:5
相关论文
共 50 条
  • [41] Rate-distortion optimal sinusoidal modeling of audio and speech using psychoacoustical matching pursuits
    Heusdens, R
    van de Par, S
    2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-IV, PROCEEDINGS, 2002, : 1809 - 1812
  • [42] Acoustic Modeling for Speech Recognition in Telephone Based Dialog System Using Limited Audio Resources
    Gajsek, Rok
    Zibert, Janez
    Mihelic, France
    TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2008, 5246 : 311 - 316
  • [43] Unsupervised speech representation learning for behavior modeling using triplet enhanced contextualized networks
    Li, Haoqi
    Baucom, Brian
    Narayanan, Shrikanth
    Georgiou, Panayiotis
    COMPUTER SPEECH AND LANGUAGE, 2021, 70
  • [44] Fine-grained prosody modeling in neural speech synthesis using ToBI representation
    Zou, Yuxiang
    Liu, Shichao
    Yin, Xiang
    Lin, Haopeng
    Wang, Chunfeng
    Zhang, Haoyu
    Ma, Zejun
    INTERSPEECH 2021, 2021, : 3146 - 3150
  • [45] Modeling and Analyzing the Strategy Game "Factorio" Using Modular Petri Nets and the General-Purpose Petri Net Simulator
    Chandler, Benjamin Alexander
    Davidrajuh, Reggie
    ELECTRONICS, 2024, 13 (07)
  • [46] AUDIO CLASSIFICATION OF MUSIC/SPEECH MIXED SIGNALS USING SINUSOIDAL MODELING WITH SVM AND NEURAL NETWORK APPROACH
    Mowlaee, Pejman
    Sayadiyan, Abolghasem
    JOURNAL OF CIRCUITS SYSTEMS AND COMPUTERS, 2013, 22 (02)
  • [47] Supervised single-channel speech dereverberation and denoising using a two-stage model based sparse representation
    Zhang Long
    Xu Xu
    Chen Huang
    Chen Jiaxu
    Ye Zhongfu
    SPEECH COMMUNICATION, 2018, 97 : 1 - 8
  • [48] High-Performance and Energy-Efficient Fault Diagnosis Using Effective Envelope Analysis and Denoising on a General-Purpose Graphics Processing Unit
    Kang, Myeongsu
    Kim, Jaeyoung
    Kim, Jong-Myon
    IEEE TRANSACTIONS ON POWER ELECTRONICS, 2015, 30 (05) : 2763 - 2776
  • [50] Modeling airside airport operations using general-purpose, activity-based, discrete-event simulation tools
    Martinez, JC
    Trani, AA
    Ioannou, PG
    ISSUES IN AVIATION: AIRPORTS, CAPACITY, AND AIR TRAFFIC CONTROL AND MANAGEMENT: AVIATION, 2001, (1744): : 65 - 71