Masked Modeling Duo for Speech: Specializing General-Purpose Audio Representation to Speech using Denoising Distillation

被引：0

作者：

Niizumi, Daisuke ^{[1
]}

Takeuchi, Daiki ^{[1
]}

Ohishi, Yasunori ^{[1
]}

Harada, Noboru ^{[1
]}

Kashino, Kunio ^{[1
]}

机构：

[1] NTT Corp, Tokyo, Japan

来源：

INTERSPEECH 2023 | 2023年

关键词：

speech representation learning; general-purpose audio representation; denoising; distillation; specialization;

D O I：

10.21437/Interspeech.2023-221

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Self-supervised learning general-purpose audio representations have demonstrated high performance in a variety of tasks. Although they can be optimized for application by fine-tuning, even higher performance can be expected if they can be specialized to pre-train for an application. This paper explores the challenges and solutions in specializing general-purpose audio representations for a specific application using speech, a highly demanding field, as an example. We enhance Masked Modeling Duo (M2D), a general-purpose model, to close the performance gap with state-of-the-art (SOTA) speech models. To do so, we propose a new task, denoising distillation, to learn from fine-grained clustered features, and M2D for Speech (M2D-S), which jointly learns the denoising distillation task and M2D masked prediction task. Experimental results show that M2D-S performs comparably to or outperforms SOTA speech models on the SUPERB benchmark, demonstrating that M2D can specialize in a demanding field.

引用

页码：1294 / 1298

页数：5

共 50 条

[31] A comprehensive noise robust speech parameterization algorithm using wavelet packet decomposition-based denoising and speech feature representation techniques
Kotnik, Bojan
Kacic, Zdravko
EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING, 2007, 2007 (1)
[32] A Comprehensive Noise Robust Speech Parameterization Algorithm Using Wavelet Packet Decomposition-Based Denoising and Speech Feature Representation Techniques
Bojan Kotnik
Zdravko Kačič
EURASIP Journal on Advances in Signal Processing, 2007
[33] EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling
Liu, Haiyang
Zhu, Zihao
Becherini, Giorgio
Peng, Yichen
Su, Mingyang
Zhou, You
Zhe, Xuefei
Iwamoto, Naoya
Zheng, Bo
Black, Michael J.
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 1144 - 1154
[34] Development of general-purpose large-scale data visualization system using implicit function representation
Shuji K.
Mitsume N.
Morita N.
Transactions of the Japan Society for Computational Engineering and Science, 2024, 2024 (01)
[35] Bio-Inspired Sparse Representation of Speech and Audio Using Psychoacoustic Adaptive Matching Pursuit
Petrovsky, Alexey
Herasimovich, Vadzim
Petrovsky, Alexander
SPEECH AND COMPUTER, 2016, 9811 : 156 - 164
[36] A SIMULATION STUDY OF THE ASTER SENSOR USING A VERSATILE GENERAL-PURPOSE RIGID SENSOR MODELING SYSTEM
ONEILL, MA
DOWMAN, IJ
INTERNATIONAL JOURNAL OF REMOTE SENSING, 1993, 14 (03) : 565 - 582
[37] Modeling trajectories of human speech articulators using general Tau theory
Elie, Benjamin
Lee, David N.
Turk, Alice
SPEECH COMMUNICATION, 2023, 151 : 24 - 38
[38] Mi-Go: tool which uses YouTube as data source for evaluating general-purpose speech recognition machine learning models
Wojnar, Tomasz
Hryszko, Jaroslaw
Roman, Adam
EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2024, 2024 (01):
[39] IMAGE AND AUDIO-SPEECH DENOISING BASED ON HIGHER-ORDER STATISTICAL MODELING OF WAVELET COEFFICIENTS AND LOCAL VARIANCE ESTIMATION
Kittisuwan, Pichid
Chanwimaluan, Thitiporn
Marukatat, Sanparith
Asdornwised, Widhyakorn
INTERNATIONAL JOURNAL OF WAVELETS MULTIRESOLUTION AND INFORMATION PROCESSING, 2010, 8 (06) : 987 - 1017
[40] Difusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation
Deichler, Anna
Mehta, Shivam
Alexanderson, Simon
Beskow, Jonas
PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2023, 2023, : 755 - 762

← 1 2 3 4 5 →