Masked Modeling Duo for Speech: Specializing General-Purpose Audio Representation to Speech using Denoising Distillation

被引:0
|
作者
Niizumi, Daisuke [1 ]
Takeuchi, Daiki [1 ]
Ohishi, Yasunori [1 ]
Harada, Noboru [1 ]
Kashino, Kunio [1 ]
机构
[1] NTT Corp, Tokyo, Japan
来源
INTERSPEECH 2023 | 2023年
关键词
speech representation learning; general-purpose audio representation; denoising; distillation; specialization;
D O I
10.21437/Interspeech.2023-221
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Self-supervised learning general-purpose audio representations have demonstrated high performance in a variety of tasks. Although they can be optimized for application by fine-tuning, even higher performance can be expected if they can be specialized to pre-train for an application. This paper explores the challenges and solutions in specializing general-purpose audio representations for a specific application using speech, a highly demanding field, as an example. We enhance Masked Modeling Duo (M2D), a general-purpose model, to close the performance gap with state-of-the-art (SOTA) speech models. To do so, we propose a new task, denoising distillation, to learn from fine-grained clustered features, and M2D for Speech (M2D-S), which jointly learns the denoising distillation task and M2D masked prediction task. Experimental results show that M2D-S performs comparably to or outperforms SOTA speech models on the SUPERB benchmark, demonstrating that M2D can specialize in a demanding field.
引用
收藏
页码:1294 / 1298
页数:5
相关论文
共 50 条
  • [31] A comprehensive noise robust speech parameterization algorithm using wavelet packet decomposition-based denoising and speech feature representation techniques
    Kotnik, Bojan
    Kacic, Zdravko
    EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING, 2007, 2007 (1)
  • [32] A Comprehensive Noise Robust Speech Parameterization Algorithm Using Wavelet Packet Decomposition-Based Denoising and Speech Feature Representation Techniques
    Bojan Kotnik
    Zdravko Kačič
    EURASIP Journal on Advances in Signal Processing, 2007
  • [33] EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling
    Liu, Haiyang
    Zhu, Zihao
    Becherini, Giorgio
    Peng, Yichen
    Su, Mingyang
    Zhou, You
    Zhe, Xuefei
    Iwamoto, Naoya
    Zheng, Bo
    Black, Michael J.
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 1144 - 1154
  • [34] Development of general-purpose large-scale data visualization system using implicit function representation
    Shuji K.
    Mitsume N.
    Morita N.
    Transactions of the Japan Society for Computational Engineering and Science, 2024, 2024 (01)
  • [35] Bio-Inspired Sparse Representation of Speech and Audio Using Psychoacoustic Adaptive Matching Pursuit
    Petrovsky, Alexey
    Herasimovich, Vadzim
    Petrovsky, Alexander
    SPEECH AND COMPUTER, 2016, 9811 : 156 - 164
  • [36] A SIMULATION STUDY OF THE ASTER SENSOR USING A VERSATILE GENERAL-PURPOSE RIGID SENSOR MODELING SYSTEM
    ONEILL, MA
    DOWMAN, IJ
    INTERNATIONAL JOURNAL OF REMOTE SENSING, 1993, 14 (03) : 565 - 582
  • [37] Modeling trajectories of human speech articulators using general Tau theory
    Elie, Benjamin
    Lee, David N.
    Turk, Alice
    SPEECH COMMUNICATION, 2023, 151 : 24 - 38
  • [38] Mi-Go: tool which uses YouTube as data source for evaluating general-purpose speech recognition machine learning models
    Wojnar, Tomasz
    Hryszko, Jaroslaw
    Roman, Adam
    EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2024, 2024 (01):
  • [39] IMAGE AND AUDIO-SPEECH DENOISING BASED ON HIGHER-ORDER STATISTICAL MODELING OF WAVELET COEFFICIENTS AND LOCAL VARIANCE ESTIMATION
    Kittisuwan, Pichid
    Chanwimaluan, Thitiporn
    Marukatat, Sanparith
    Asdornwised, Widhyakorn
    INTERNATIONAL JOURNAL OF WAVELETS MULTIRESOLUTION AND INFORMATION PROCESSING, 2010, 8 (06) : 987 - 1017
  • [40] Difusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation
    Deichler, Anna
    Mehta, Shivam
    Alexanderson, Simon
    Beskow, Jonas
    PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2023, 2023, : 755 - 762