Masked Modeling Duo for Speech: Specializing General-Purpose Audio Representation to Speech using Denoising Distillation

被引：0

作者：

Niizumi, Daisuke ^{[1
]}

Takeuchi, Daiki ^{[1
]}

Ohishi, Yasunori ^{[1
]}

Harada, Noboru ^{[1
]}

Kashino, Kunio ^{[1
]}

机构：

[1] NTT Corp, Tokyo, Japan

来源：

INTERSPEECH 2023 | 2023年

关键词：

speech representation learning; general-purpose audio representation; denoising; distillation; specialization;

D O I：

10.21437/Interspeech.2023-221

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Self-supervised learning general-purpose audio representations have demonstrated high performance in a variety of tasks. Although they can be optimized for application by fine-tuning, even higher performance can be expected if they can be specialized to pre-train for an application. This paper explores the challenges and solutions in specializing general-purpose audio representations for a specific application using speech, a highly demanding field, as an example. We enhance Masked Modeling Duo (M2D), a general-purpose model, to close the performance gap with state-of-the-art (SOTA) speech models. To do so, we propose a new task, denoising distillation, to learn from fine-grained clustered features, and M2D for Speech (M2D-S), which jointly learns the denoising distillation task and M2D masked prediction task. Experimental results show that M2D-S performs comparably to or outperforms SOTA speech models on the SUPERB benchmark, demonstrating that M2D can specialize in a demanding field.

引用

页码：1294 / 1298

页数：5

共 50 条

[41] Rate-distortion optimal sinusoidal modeling of audio and speech using psychoacoustical matching pursuits
Heusdens, R
van de Par, S
2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-IV, PROCEEDINGS, 2002, : 1809 - 1812
[42] Acoustic Modeling for Speech Recognition in Telephone Based Dialog System Using Limited Audio Resources
Gajsek, Rok
Zibert, Janez
Mihelic, France
TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2008, 5246 : 311 - 316
[43] Unsupervised speech representation learning for behavior modeling using triplet enhanced contextualized networks
Li, Haoqi
Baucom, Brian
Narayanan, Shrikanth
Georgiou, Panayiotis
COMPUTER SPEECH AND LANGUAGE, 2021, 70
[44] Fine-grained prosody modeling in neural speech synthesis using ToBI representation
Zou, Yuxiang
Liu, Shichao
Yin, Xiang
Lin, Haopeng
Wang, Chunfeng
Zhang, Haoyu
Ma, Zejun
INTERSPEECH 2021, 2021, : 3146 - 3150
[45] Modeling and Analyzing the Strategy Game "Factorio" Using Modular Petri Nets and the General-Purpose Petri Net Simulator
Chandler, Benjamin Alexander
Davidrajuh, Reggie
ELECTRONICS, 2024, 13 (07)
[46] AUDIO CLASSIFICATION OF MUSIC/SPEECH MIXED SIGNALS USING SINUSOIDAL MODELING WITH SVM AND NEURAL NETWORK APPROACH
Mowlaee, Pejman
Sayadiyan, Abolghasem
JOURNAL OF CIRCUITS SYSTEMS AND COMPUTERS, 2013, 22 (02)
[47] Supervised single-channel speech dereverberation and denoising using a two-stage model based sparse representation
Zhang Long
Xu Xu
Chen Huang
Chen Jiaxu
Ye Zhongfu
SPEECH COMMUNICATION, 2018, 97 : 1 - 8
[48] High-Performance and Energy-Efficient Fault Diagnosis Using Effective Envelope Analysis and Denoising on a General-Purpose Graphics Processing Unit
Kang, Myeongsu
Kim, Jaeyoung
Kim, Jong-Myon
IEEE TRANSACTIONS ON POWER ELECTRONICS, 2015, 30 (05) : 2763 - 2776
[49] Modeling airside airport operations using general-purpose, activity-based, discrete-event simulation tools
Martinez, J.C., 2001, National Research Council
[50] Modeling airside airport operations using general-purpose, activity-based, discrete-event simulation tools
Martinez, JC
Trani, AA
Ioannou, PG
ISSUES IN AVIATION: AIRPORTS, CAPACITY, AND AIR TRAFFIC CONTROL AND MANAGEMENT: AVIATION, 2001, (1744): : 65 - 71

← 1 2 3 4 5 →