Aggregation Strategies of Wav2vec 2.0 Embeddings for Computational Paralinguistic Tasks

被引:0
|
作者
Vetrab, Mercedes [1 ]
Gosztolya, Gabor [1 ,2 ]
机构
[1] Univ Szeged, Inst Informat, Szeged, Hungary
[2] ELKH SZTE Res Grp Artificial Intelligence, Szeged, Hungary
来源
SPEECH AND COMPUTER, SPECOM 2023, PT I | 2023年 / 14338卷
关键词
Paralinguistics; Wav2vec; 2.0; Embeddings; Aggregation; CLASSIFICATION;
D O I
10.1007/978-3-031-48309-7_7
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Throughout the history of computational paralinguistics, numerous feature extraction, preprocessing and classification techniques have been used. One of the important challenges in this subfield of speech technology is handling utterances with different duration. Since standard speech processing features (such as filter banks or DNN embeddings) are typically frame-level ones and we would like to classify whole utterances, a set of frame-level features have to be converted into fixed-sized utterance-level features. The choice of this aggregation method is often overlooked, and simple functions like mean and/or standard deviation are used without solid experimental support. In this study we take wav2vec 2.0 deep embeddings, and aggregate them with 11 different functions. We sought to obtain a subset of potentially optimal aggregation functions, because there are no general rules yet that can be applied universally between subtopics. Besides testing both standard and non-traditional aggregation strategies individually, we also combined them to improve the classification performance. By using multiple aggregation functions, we were able to achieve significant improvements on three public paralinguistic corpora.
引用
收藏
页码:79 / 93
页数:15
相关论文
共 50 条
  • [1] Emotion Recognition from Speech Using Wav2vec 2.0 Embeddings
    Pepino, Leonardo
    Riera, Pablo
    Ferrer, Luciana
    INTERSPEECH 2021, 2021, : 3400 - 3404
  • [2] A Preliminary Study on Wav2Vec 2.0 Embeddings for Text-to-Speech
    Lim, Yohan
    Kim, Namhyeong
    Yun, Seung
    Kim, Hun
    Lee, Seung-Ik
    12TH INTERNATIONAL CONFERENCE ON ICT CONVERGENCE (ICTC 2021): BEYOND THE PANDEMIC ERA WITH ICT CONVERGENCE INNOVATION, 2021, : 343 - 347
  • [3] Comparison of wav2vec 2.0 models on three speech processing tasks
    Kunešová, Marie
    Zajíc, Zbyněk
    Šmídl, Luboš
    Karafiát, Martin
    International Journal of Speech Technology, 2024, 27 (04) : 847 - 859
  • [4] Explore Wav2vec 2.0 for Mispronunciation Detection
    Xu, Xiaoshuo
    Kang, Yueteng
    Cao, Songjun
    Lin, Binghuai
    Ma, Long
    INTERSPEECH 2021, 2021, : 4428 - 4432
  • [5] Learning Music Representations with wav2vec 2.0
    Ragano, Alessandro
    Benetos, Emmanouil
    Hines, Andrew
    2023 31ST IRISH CONFERENCE ON ARTIFICIAL INTELLIGENCE AND COGNITIVE SCIENCE, AICS, 2023,
  • [6] Wav2vec 2.0 Embeddings Are No Swiss Army Knife - A Case Study for Multiple Sclerosis
    Gosztolya, Gabor
    Vetrend, Mercedes
    Svindt, Veronika
    Bona, Judit
    Hoffmann, Ildiko
    INTERSPEECH 2024, 2024, : 2499 - 2503
  • [7] Audio Features from the Wav2Vec 2.0 Embeddings for the ACM Multimedia 2022 Stuttering Challenge
    Montacie, Claude
    Caraty, Marie-Jose
    Lackovic, Nikola
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 7195 - 7199
  • [8] End to End Spoken Language Diarization with Wav2vec Embeddings
    Mishra, Jagabandhu
    Patil, Jayadev N.
    Chowdhury, Amartya
    Prasanna, S. R. Mahadeva
    INTERSPEECH 2023, 2023, : 501 - 505
  • [9] Exploring wav2vec 2.0 on speaker verification and language identification
    Fan, Zhiyun
    Li, Meng
    Zhou, Shiyu
    Xu, Bo
    INTERSPEECH 2021, 2021, : 1509 - 1513
  • [10] On-demand compute reduction with stochastic wav2vec 2.0
    Vyas, Apvorv
    Hsu, Wei-Ning
    Auli, Michael
    Baevski, Alexei
    INTERSPEECH 2022, 2022, : 3048 - 3052