Improving Multimodal Speech Recognition by Data Augmentation and Speech Representations

被引:5
|
作者
Oneata, Dan [1 ]
Cucu, Horia [1 ]
机构
[1] Univ Politehn Bucuresti, Speech & Dialogue Res Lab, Bucharest, Romania
关键词
MODEL;
D O I
10.1109/CVPRW56347.2022.00504
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Multimodal speech recognition aims to improve the performance of automatic speech recognition (ASR) systems by leveraging additional visual information that is usually associated to the audio input. While previous approaches make crucial use of strong visual representations, e.g. by fine-tuning pretrained image recognition networks, significantly less attention has been paid to its counterpart: the speech component. In this work, we investigate ways of improving the base speech recognition system by following similar techniques to the ones used for the visual encoder, namely, transferring representations and data augmentation. First, we show that starting from a pretrained ASR significantly improves the state-of-the-art performance; remarkably, even when building upon a strong unimodal system, we still find gains by including the visual modality. Second, we employ speech data augmentation techniques to encourage the multimodal system to attend to the visual stimuli. This technique replaces previously used word masking and comes with the benefits of being conceptually simpler and yielding consistent improvements in the multimodal setting. We provide empirical results on three multimodal datasets, including the newly introduced Localized Narratives.
引用
收藏
页码:4578 / 4587
页数:10
相关论文
共 50 条
  • [1] Improving Speech Emotion Recognition With Adversarial Data Augmentation Network
    Yi, Lu
    Mak, Man-Wai
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022, 33 (01) : 172 - 184
  • [2] A Data Augmentation Approach for Improving the Performance of Speech Emotion Recognition
    Paraskevopoulou, Georgia
    Spyrou, Evaggelos
    Perantonis, Stavros
    SIGMAP: PROCEEDINGS OF THE 19TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND MULTIMEDIA APPLICATIONS, 2022, : 61 - 69
  • [3] Improving Turkish Telephone Speech Recognition with Data Augmentation and Out of Domain Data
    Uslu, Zeynep Gulhan
    Yildirim, Tulay
    2019 16TH INTERNATIONAL MULTI-CONFERENCE ON SYSTEMS, SIGNALS & DEVICES (SSD), 2019, : 176 - 179
  • [4] Improving Low Resource Turkish Speech Recognition with Data Augmentation and TTS
    Gokay, Ramazan
    Yalcin, Hulya
    2019 16TH INTERNATIONAL MULTI-CONFERENCE ON SYSTEMS, SIGNALS & DEVICES (SSD), 2019, : 357 - 360
  • [5] Towards improving automatic speech recognition for underrepresented dialects with data augmentation
    Bakst, Sarah
    Yilmaz, Emre
    Castan, Diego
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2023, 153 (03):
  • [6] Improving speech recognition using data augmentation and acoustic model fusion
    Rebai, Ilyes
    BenAyed, Yessine
    Mahdi, Walid
    Lorre, Jean-Pierre
    KNOWLEDGE-BASED AND INTELLIGENT INFORMATION & ENGINEERING SYSTEMS, 2017, 112 : 316 - 322
  • [7] Significance of Data Augmentation for Improving Cleft Lip and Palate Speech Recognition
    Sudro, Protima Nomo
    Das, Rohan Kumar
    Sinha, Rohit
    Prasanna, S. R. Mahadeva
    2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 484 - 490
  • [8] Hypo and Hyperarticulated Speech Data Augmentation for Spontaneous Speech Recognition
    Lee, Sung Joo
    Kang, Byung-Ok
    Chung, Hoon
    Park, Jeon Gue
    Lee, Yun Keun
    2018 26TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2018, : 2080 - 2084
  • [9] A STUDY ON DATA AUGMENTATION OF REVERBERANT SPEECH FOR ROBUST SPEECH RECOGNITION
    Ko, Tom
    Peddinti, Vijayaditya
    Povey, Daniel
    Seltzer, Michael L.
    Khudanpur, Sanjeev
    2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 5220 - 5224
  • [10] Data Augmentation using Healthy Speech for Dysarthric Speech Recognition
    Vachhani, Bhavik
    Bhat, Chitralekha
    Kopparapu, Sunil Kumar
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 471 - 475