Random Utterance Concatenation Based Data Augmentation for Improving Short-video Speech Recognition

被引:0
|
作者
Lin, Yist Y. [1 ]
Han, Tao [1 ]
Xu, Haihua [1 ]
Van Tung Pham [1 ]
Khassanov, Yerbolat [1 ]
Chong, Tze Yuang [1 ]
He, Yi [1 ]
Lu, Lu [1 ]
Ma, Zejun [1 ]
机构
[1] ByteDance, Beijing, Peoples R China
来源
关键词
random utterance concatenation; data augmentation; short video; end-to-end; speech recognition;
D O I
10.21437/Interspeech.2023-1272
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
One of limitations in end-to-end automatic speech recognition (ASR) framework is its performance would be compromised if train-test utterance lengths are mismatched. In this paper, we propose an on-the-fly random utterance concatenation (RUC) based data augmentation method to alleviate train-test utterance length mismatch issue for short-video ASR task. Specifically, we are motivated by observations that our human-transcribed training utterances tend to be much shorter for short-video spontaneous speech (similar to 3 seconds on average), while our test utterance generated from voice activity detection front-end is much longer (similar to 10 seconds on average). Such a mismatch can lead to suboptimal performance. Empirically, it's observed the proposed RUC method significantly improves long utterance recognition without performance drop on short one. Overall, it achieves 5.72% word error rate reduction on average for 15 languages and improved robustness to various utterance length.
引用
收藏
页码:904 / 908
页数:5
相关论文
共 50 条
  • [31] Speech emotion recognition using data augmentation
    Praseetha, V. M.
    Joby, P. P.
    INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2021, 25 (4) : 783 - 792
  • [32] Audio Codec Simulation based Data Augmentation for Telephony Speech Recognition
    Thi-Ly Vu
    Zeng, Zhiping
    Xu, Haihua
    Chng, Eng-Siong
    2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 198 - 203
  • [33] Whispered Speech Recognition Based on Audio Data Augmentation and Inverse Filtering
    Galic, Jovan
    Markovic, Branko
    Grozdic, Dorde
    Popovic, Branislav
    Sajic, Slavko
    APPLIED SCIENCES-BASEL, 2024, 14 (18):
  • [34] Strong Generalized Speech Emotion Recognition Based on Effective Data Augmentation
    Tao, Huawei
    Shan, Shuai
    Hu, Ziyi
    Zhu, Chunhua
    Ge, Hongyi
    ENTROPY, 2023, 25 (01)
  • [35] Adversarial Data Augmentation for Disordered Speech Recognition
    Jin, Zengrui
    Geng, Mengzhe
    Xie, Xurong
    Yu, Jianwei
    Liu, Shansong
    Liu, Xunying
    Meng, Helen
    INTERSPEECH 2021, 2021, : 4803 - 4807
  • [36] SNR-Selection-Based-Data Augmentation for Dysarthric Speech Recognition
    Nawroly, Sarkhell Sirwan
    Popescu, Decebal Gheorghe
    Antony, Mariya Celin Thekekara
    Philominal, Actlin Jeeva Muthu
    STUDIES IN INFORMATICS AND CONTROL, 2023, 32 (04): : 129 - 140
  • [37] Data Augmentation Based on Frequency Warping for Recognition of Cleft Palate Speech
    Fujiwara, Kento
    Takashima, Ryoichi
    Sugiyama, Chihiro
    Tanaka, Nobukazu
    Nohara, Kanji
    Nozaki, Kazunori
    Takiguchi, Tetsuya
    2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 471 - 476
  • [38] Analyzing the Evolution of Internet Public Opinion Based on Short-Video Network
    Wei H.
    Zhu H.
    Wei J.
    Ye D.
    Data Analysis and Knowledge Discovery, 2024, 8 (05) : 113 - 126
  • [39] Hypo and Hyperarticulated Speech Data Augmentation for Spontaneous Speech Recognition
    Lee, Sung Joo
    Kang, Byung-Ok
    Chung, Hoon
    Park, Jeon Gue
    Lee, Yun Keun
    2018 26TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2018, : 2080 - 2084
  • [40] A STUDY ON DATA AUGMENTATION OF REVERBERANT SPEECH FOR ROBUST SPEECH RECOGNITION
    Ko, Tom
    Peddinti, Vijayaditya
    Povey, Daniel
    Seltzer, Michael L.
    Khudanpur, Sanjeev
    2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 5220 - 5224