Improving Multimodal Speech Recognition by Data Augmentation and Speech Representations

被引:5
|
作者
Oneata, Dan [1 ]
Cucu, Horia [1 ]
机构
[1] Univ Politehn Bucuresti, Speech & Dialogue Res Lab, Bucharest, Romania
关键词
MODEL;
D O I
10.1109/CVPRW56347.2022.00504
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Multimodal speech recognition aims to improve the performance of automatic speech recognition (ASR) systems by leveraging additional visual information that is usually associated to the audio input. While previous approaches make crucial use of strong visual representations, e.g. by fine-tuning pretrained image recognition networks, significantly less attention has been paid to its counterpart: the speech component. In this work, we investigate ways of improving the base speech recognition system by following similar techniques to the ones used for the visual encoder, namely, transferring representations and data augmentation. First, we show that starting from a pretrained ASR significantly improves the state-of-the-art performance; remarkably, even when building upon a strong unimodal system, we still find gains by including the visual modality. Second, we employ speech data augmentation techniques to encourage the multimodal system to attend to the visual stimuli. This technique replaces previously used word masking and comes with the benefits of being conceptually simpler and yielding consistent improvements in the multimodal setting. We provide empirical results on three multimodal datasets, including the newly introduced Localized Narratives.
引用
收藏
页码:4578 / 4587
页数:10
相关论文
共 50 条
  • [41] Improving Speech Synthesis by Automatic Speech Recognition and Speech Discriminator
    Huang, Li-Yu
    Chen, Chia-Ping
    JOURNAL OF INFORMATION SCIENCE AND ENGINEERING, 2024, 40 (01) : 189 - 200
  • [42] Learning Emotional Representations from Imbalanced Speech Data for Speech Emotion Recognition and Emotional Text-to-Speech
    Wang, Shijun
    Gudnason, Jon
    Borth, Damian
    INTERSPEECH 2023, 2023, : 351 - 355
  • [43] Personalized Adversarial Data Augmentation for Dysarthric and Elderly Speech Recognition
    Jin, Zengrui
    Geng, Mengzhe
    Deng, Jiajun
    Wang, Tianzi
    Hu, Shujie
    Li, Guinan
    Liu, Xunying
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 413 - 429
  • [44] Data Augmentation Techniques for Speech Emotion Recognition and Deep Learning
    Antonio Nicolas, Jose
    de Lope, Javier
    Grana, Manuel
    BIO-INSPIRED SYSTEMS AND APPLICATIONS: FROM ROBOTICS TO AMBIENT INTELLIGENCE, PT II, 2022, 13259 : 279 - 288
  • [45] A Feature Fusion Model with Data Augmentation for Speech Emotion Recognition
    Tu, Zhongwen
    Liu, Bin
    Zhao, Wei
    Yan, Raoxin
    Zou, Yang
    APPLIED SCIENCES-BASEL, 2023, 13 (07):
  • [46] SPEECH EMOTION RECOGNITION WITH MULTISCALE AREA ATTENTION AND DATA AUGMENTATION
    Xu, Mingke
    Zhang, Fan
    Cui, Xiaodong
    Zhang, Wei
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6319 - 6323
  • [47] SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
    Park, Daniel S.
    Chan, William
    Zhang, Yu
    Chiu, Chung-Cheng
    Zoph, Barret
    Cubuk, Ekin D.
    Le, Quoc, V
    INTERSPEECH 2019, 2019, : 2613 - 2617
  • [48] Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation
    Bartelds, Martijn
    San, Nay
    McDonnell, Bradley
    Jurafsky, Dan
    Wieling, Martijn
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 715 - 729
  • [49] A Survey of the Effects of Data Augmentation for Automatic Speech Recognition Systems
    Manuel Ramirez, Jose
    Montalvo, Ana
    Ramon Calvo, Jose
    PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS, COMPUTER VISION, AND APPLICATIONS (CIARP 2019), 2019, 11896 : 669 - 678
  • [50] Refining Synthesized Speech Using Speaker Information and Phone Masking for Data Augmentation of Speech Recognition
    Ueno, Sei
    Lee, Akinobu
    Kawahara, Tatsuya
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 3924 - 3933