Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models Performance

被引:0
|
作者
Perezhohin, Yuriy [1 ,2 ]
Santos, Tiago [1 ,2 ]
Costa, Victor [1 ,2 ]
Peres, Fernando [1 ]
Castelli, Mauro [2 ]
机构
[1] MyNorth AI Res, P-2780125 Oeiras, Portugal
[2] Univ NOVA Lisboa, NOVA Informat Management Sch NOVA IMS, Campus Campolide, P-1070312 Lisbon, Portugal
来源
IEEE ACCESS | 2024年 / 12卷
关键词
Hidden Markov models; Feature extraction; Filtering; Data models; Synthetic data; Training; Contrastive learning; Accuracy; Adaptation models; Transformers; Automatic speech recognition; Text to speech; contrastive learning; data augmentation; embeddings; synthetic data filtering; text-to-speech; REPRESENTATIONS;
D O I
10.1109/ACCESS.2024.3482970
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper presents a novel methodology for enhancing Automatic Speech Recognition (ASR) performance by utilizing contrastive learning to filter synthetic audio data. We address the challenge of incorporating synthetic data into ASR training, especially in scenarios with limited real-world data or unique linguistic characteristics. The method utilizes a contrastive learning model to align representations of synthetic audio and its corresponding text transcripts, enabling the identification and removal of low-quality samples that do not align well semantically. We evaluate the methodology on a medium-resource language across two distinct datasets: a general-domain dataset and a regionally specific dataset characterized by unique pronunciation patterns. Experimental results reveal that the optimal filtering strategy depends on both model capacity and dataset characteristics. Larger models, like Whisper Large V3, particularly benefit from aggressive filtering, while smaller models may not require such stringent filtering, especially on non-normalized text. This work highlights the importance of adjusting synthetic data augmentation and filtering to specific model architectures and target domains. The proposed method, robust and adaptable, enhances ASR performance across diverse language settings. We have open-sourced the entire work, which includes 140 hours of synthetically generated Portuguese speech, as well as the pipeline and parameter settings used to create these samples. Additionally, we provide the fine-tuned Whisper models and the code required to reproduce this research. Our code will be available at https://github.com/my-north-ai/semantic_audio_filtering.
引用
收藏
页码:155136 / 155150
页数:15
相关论文
共 50 条
  • [1] Automatic Speech Recognition Models: A Characteristic and Performance Review
    Patil, U. G.
    Shirbahadurkar, S. D.
    Paithane, A. N.
    2016 INTERNATIONAL CONFERENCE ON COMPUTING COMMUNICATION CONTROL AND AUTOMATION (ICCUBEA), 2016,
  • [2] SEMANTIC WORD EMBEDDING NEURAL NETWORK LANGUAGE MODELS FOR AUTOMATIC SPEECH RECOGNITION
    Audhkhasi, Kartik
    Sethy, Abhinav
    Ramabhadran, Bhuvana
    2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 5995 - 5999
  • [3] Morphological filtering of spectrograms for automatic speech recognition
    Liu, WM
    Bastante, VJR
    Rodriguez, FR
    Evans, NWD
    Mason, JSD
    Proceedings of the Fourth IASTED International Conference on Visualization, Imaging, and Image Processing, 2004, : 546 - 549
  • [4] The effect of speech and audio compression on speech recognition performance
    Besacier, L
    Bergamini, C
    Vaufreydaz, D
    Castelli, E
    2001 IEEE FOURTH WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 2001, : 301 - 306
  • [5] Investigation of Automatic Speech Recognition Performance and Mean Opinion Scores for Different Standard Speech and Audio Codecs
    Ramana, A. V.
    Parayitam, Laxminarayana
    Pala, Mythili Sharan
    IETE JOURNAL OF RESEARCH, 2012, 58 (02) : 121 - 129
  • [6] REVERBERATION, MASKING, FILTERING, AND LEVEL EFFECTS ON SPEECH RECOGNITION PERFORMANCE
    LOVEN, FC
    COLLINS, MJ
    JOURNAL OF SPEECH AND HEARING RESEARCH, 1988, 31 (04): : 681 - 695
  • [7] Synthesising Audio Adversarial Examples for Automatic Speech Recognition
    Qu, Xinghua
    Wei, Pengfei
    Gao, Mingyong
    Sun, Zhu
    Ong, Yew-Soon
    Ma, Zejun
    PROCEEDINGS OF THE 28TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2022, 2022, : 1430 - 1440
  • [8] Automatic speech recognition using audio visual cues
    Yashwanth, H
    Mahendrakar, H
    David, S
    PROCEEDINGS OF THE IEEE INDICON 2004, 2004, : 166 - 169
  • [9] Automatic Speech Recognition System for Lithuanian Broadcast Audio
    Alumae, Tanel
    Tilk, Ottokar
    HUMAN LANGUAGE TECHNOLOGIES - THE BALTIC PERSPECTIVE, 2016, 289 : 39 - 45
  • [10] AUTOMATIC RECOGNITION OF SPEECH WITHOUT ANY AUDIO INFORMATION
    Heracleous, Panikos
    Hagita, Norihiro
    2011 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2011, : 2392 - 2395