Improving Multimodal Speech Recognition by Data Augmentation and Speech Representations

被引:5
|
作者
Oneata, Dan [1 ]
Cucu, Horia [1 ]
机构
[1] Univ Politehn Bucuresti, Speech & Dialogue Res Lab, Bucharest, Romania
关键词
MODEL;
D O I
10.1109/CVPRW56347.2022.00504
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Multimodal speech recognition aims to improve the performance of automatic speech recognition (ASR) systems by leveraging additional visual information that is usually associated to the audio input. While previous approaches make crucial use of strong visual representations, e.g. by fine-tuning pretrained image recognition networks, significantly less attention has been paid to its counterpart: the speech component. In this work, we investigate ways of improving the base speech recognition system by following similar techniques to the ones used for the visual encoder, namely, transferring representations and data augmentation. First, we show that starting from a pretrained ASR significantly improves the state-of-the-art performance; remarkably, even when building upon a strong unimodal system, we still find gains by including the visual modality. Second, we employ speech data augmentation techniques to encourage the multimodal system to attend to the visual stimuli. This technique replaces previously used word masking and comes with the benefits of being conceptually simpler and yielding consistent improvements in the multimodal setting. We provide empirical results on three multimodal datasets, including the newly introduced Localized Narratives.
引用
收藏
页码:4578 / 4587
页数:10
相关论文
共 50 条
  • [21] You Do Not Need More Data: Improving End-To-End Speech Recognition by Text-To-Speech Data Augmentation
    Laptev, Aleksandr
    Korostik, Roman
    Svischev, Aleksey
    Andrusenko, Andrei
    Medennikov, Ivan
    Rybin, Sergey
    2020 13TH INTERNATIONAL CONGRESS ON IMAGE AND SIGNAL PROCESSING, BIOMEDICAL ENGINEERING AND INFORMATICS (CISP-BMEI 2020), 2020, : 439 - 444
  • [22] SPEECH AUGMENTATION USING WAVENET IN SPEECH RECOGNITION
    Wang, Jisung
    Kim, Sangki
    Lee, Yeha
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6770 - 6774
  • [23] IMPROVING SEQUENCE-TO-SEQUENCE SPEECH RECOGNITION TRAINING WITH ON-THE-FLY DATA AUGMENTATION
    Nguyen, Thai-Son
    Stuker, Sebastian
    Niehues, Jan
    Waibel, Alex
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7689 - 7693
  • [24] Improving Children's Speech Recognition through Out-of-Domain Data Augmentation
    Fainberg, Joachim
    Bell, Peter
    Lincoln, Mike
    Renals, Steve
    17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 1598 - 1602
  • [25] Data Augmentation for Improving Explainability of Hate Speech Detection
    Gunjan Ansari
    Parmeet Kaur
    Chandni Saxena
    Arabian Journal for Science and Engineering, 2024, 49 : 3609 - 3621
  • [26] Data Augmentation for Improving Explainability of Hate Speech Detection
    Ansari, Gunjan
    Kaur, Parmeet
    Saxena, Chandni
    ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2024, 49 (03) : 3609 - 3621
  • [27] Adaptive data augmentation for mandarin automatic speech recognition
    Ding, Kai
    Li, Ruixuan
    Xu, Yuelin
    Du, Xingyue
    Deng, Bin
    APPLIED INTELLIGENCE, 2024, 54 (07) : 5674 - 5687
  • [28] Adversarial Data Augmentation Network for Speech Emotion Recognition
    Yi, Lu
    Mak, Man-Wai
    2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 529 - 534
  • [29] Investigation of Data Augmentation Techniques for Disordered Speech Recognition
    Geng, Mengzhe
    Xie, Xurong
    Liu, Shansong
    Yu, Jianwei
    Hu, Shoukang
    Liu, Xunying
    Meng, Helen
    INTERSPEECH 2020, 2020, : 696 - 700
  • [30] Data Augmentation using GANs for Speech Emotion Recognition
    Chatziagapi, Aggelina
    Paraskevopoulos, Georgios
    Sgouropoulos, Dimitris
    Pantazopoulos, Georgios
    Nikandrou, Malvina
    Giannakopoulos, Theodoros
    Katsamanis, Athanasios
    Potamianos, Alexandros
    Narayanan, Shrikanth
    INTERSPEECH 2019, 2019, : 171 - 175