Random Utterance Concatenation Based Data Augmentation for Improving Short-video Speech Recognition

被引:0
|
作者
Lin, Yist Y. [1 ]
Han, Tao [1 ]
Xu, Haihua [1 ]
Van Tung Pham [1 ]
Khassanov, Yerbolat [1 ]
Chong, Tze Yuang [1 ]
He, Yi [1 ]
Lu, Lu [1 ]
Ma, Zejun [1 ]
机构
[1] ByteDance, Beijing, Peoples R China
来源
关键词
random utterance concatenation; data augmentation; short video; end-to-end; speech recognition;
D O I
10.21437/Interspeech.2023-1272
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
One of limitations in end-to-end automatic speech recognition (ASR) framework is its performance would be compromised if train-test utterance lengths are mismatched. In this paper, we propose an on-the-fly random utterance concatenation (RUC) based data augmentation method to alleviate train-test utterance length mismatch issue for short-video ASR task. Specifically, we are motivated by observations that our human-transcribed training utterances tend to be much shorter for short-video spontaneous speech (similar to 3 seconds on average), while our test utterance generated from voice activity detection front-end is much longer (similar to 10 seconds on average). Such a mismatch can lead to suboptimal performance. Empirically, it's observed the proposed RUC method significantly improves long utterance recognition without performance drop on short one. Overall, it achieves 5.72% word error rate reduction on average for 15 languages and improved robustness to various utterance length.
引用
收藏
页码:904 / 908
页数:5
相关论文
共 50 条
  • [1] Enhancing Children's Short Utterance Based ASV Using Data Augmentation Techniques and Feature Concatenation Approach
    Aziz, Shahid
    Shahnawazuddin, Syed
    SPEECH AND COMPUTER, SPECOM 2023, PT II, 2023, 14339 : 380 - 394
  • [2] Improving Short Utterance Speaker Recognition by Modeling Speech Unit Classes
    Li, Lantian
    Wang, Dong
    Zhang, Chenhao
    Zheng, Thomas Fang
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2016, 24 (06) : 1129 - 1139
  • [3] Improving Multimodal Speech Recognition by Data Augmentation and Speech Representations
    Oneata, Dan
    Cucu, Horia
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, : 4578 - 4587
  • [4] Improving Utterance Rewriter Based on MMI and Text Data Augmentation
    Yang, Lina
    Lin, Hai
    Li, Wei
    Meng, Zuqiang
    Wang, Patrick Shen-Pei
    Li, Xichun
    Luo, Huiwu
    INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2022, 36 (04)
  • [5] Speech Unit Category based Short Utterance Speaker Recognition
    Fatima, Nakhat
    Wu, Xiaojun
    Zheng, Thomas Fang
    COMPUTER SCIENCE AND INFORMATION SYSTEMS, 2012, 9 (04) : 1407 - 1430
  • [6] Short Utterance-based Video Aided Speaker Recognition
    Larcher, Anthony
    Bonastre, Jean-Francois
    Mason, John S. D.
    2008 IEEE 10TH WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, VOLS 1 AND 2, 2008, : 901 - +
  • [7] DATA AUGMENTATION BASED ON VOWEL STRETCH FOR IMPROVING CHILDREN'S SPEECH RECOGNITION
    Nagano, Tohru
    Fukuda, Takashi
    Suzuki, Masayuki
    Kurata, Gakuto
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 502 - 508
  • [8] Improving Speech Emotion Recognition With Adversarial Data Augmentation Network
    Yi, Lu
    Mak, Man-Wai
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022, 33 (01) : 172 - 184
  • [9] A Data Augmentation Approach for Improving the Performance of Speech Emotion Recognition
    Paraskevopoulou, Georgia
    Spyrou, Evaggelos
    Perantonis, Stavros
    SIGMAP: PROCEEDINGS OF THE 19TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND MULTIMEDIA APPLICATIONS, 2022, : 61 - 69
  • [10] Improving Accented Speech Recognition using Data Augmentation based on Unsupervised Text-to-Speech Synthesis
    Cong-Thanh Do
    Imai, Shuhei
    Doddipatla, Rama
    Hain, Thomas
    32ND EUROPEAN SIGNAL PROCESSING CONFERENCE, EUSIPCO 2024, 2024, : 136 - 140