Random Utterance Concatenation Based Data Augmentation for Improving Short-video Speech Recognition

被引:0
|
作者
Lin, Yist Y. [1 ]
Han, Tao [1 ]
Xu, Haihua [1 ]
Van Tung Pham [1 ]
Khassanov, Yerbolat [1 ]
Chong, Tze Yuang [1 ]
He, Yi [1 ]
Lu, Lu [1 ]
Ma, Zejun [1 ]
机构
[1] ByteDance, Beijing, Peoples R China
来源
关键词
random utterance concatenation; data augmentation; short video; end-to-end; speech recognition;
D O I
10.21437/Interspeech.2023-1272
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
One of limitations in end-to-end automatic speech recognition (ASR) framework is its performance would be compromised if train-test utterance lengths are mismatched. In this paper, we propose an on-the-fly random utterance concatenation (RUC) based data augmentation method to alleviate train-test utterance length mismatch issue for short-video ASR task. Specifically, we are motivated by observations that our human-transcribed training utterances tend to be much shorter for short-video spontaneous speech (similar to 3 seconds on average), while our test utterance generated from voice activity detection front-end is much longer (similar to 10 seconds on average). Such a mismatch can lead to suboptimal performance. Empirically, it's observed the proposed RUC method significantly improves long utterance recognition without performance drop on short one. Overall, it achieves 5.72% word error rate reduction on average for 15 languages and improved robustness to various utterance length.
引用
收藏
页码:904 / 908
页数:5
相关论文
共 50 条
  • [21] Random Concatenation: A Simple Data Augmentation Method for Neural Machine Translation
    Xiao, Nini
    Zhang, Huaao
    Jin, Chang
    Duan, Xiangyu
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, NLPCC 2022, PT I, 2022, 13551 : 69 - 80
  • [22] Action Video Recognition Framework based on NetVLAD with Data Augmentation
    Wang, Fa-fa
    Kong, Jian-lei
    Peng, Shi-yu
    Jin, Xue-bo
    Su, Ting-li
    Bai, Yu-ting
    2018 CHINESE AUTOMATION CONGRESS (CAC), 2018, : 1986 - 1991
  • [23] Improving Transformer-based Speech Recognition Systems with Compressed Structure and Speech Attributes Augmentation
    Li, Sheng
    Raj, Dabre
    Lu, Xugang
    Shen, Peng
    Kawahara, Tatsuya
    Kawai, Hisashi
    INTERSPEECH 2019, 2019, : 4400 - 4404
  • [24] Improving Short Utterance based I-vector Speaker Recognition using Source and Utterance-Duration Normalization Techniques
    Kanagasundaram, A.
    Dean, D.
    Gonzalez-Dominguez, J.
    Sridharan, S.
    Ramos, D.
    Gonzalez-Rodriguez, J.
    14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 2464 - 2468
  • [25] Improving Children's Speech Recognition through Out-of-Domain Data Augmentation
    Fainberg, Joachim
    Bell, Peter
    Lincoln, Mike
    Renals, Steve
    17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 1598 - 1602
  • [26] IMPROVING SEQUENCE-TO-SEQUENCE SPEECH RECOGNITION TRAINING WITH ON-THE-FLY DATA AUGMENTATION
    Nguyen, Thai-Son
    Stuker, Sebastian
    Niehues, Jan
    Waibel, Alex
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7689 - 7693
  • [27] Improving Diacritical Arabic Speech Recognition: Transformer-Based Models with Transfer Learning and Hybrid Data Augmentation
    Alaqel, Haifa
    El Hindi, Khalil
    Information (Switzerland), 2025, 16 (03)
  • [28] Improving transformer-based speech recognition performance using data augmentation by local frame rate changes
    Lim, Seong Su
    Kang, Byung Ok
    Kwon, Oh-Wook
    JOURNAL OF THE ACOUSTICAL SOCIETY OF KOREA, 2022, 41 (02): : 122 - 129
  • [29] Speech Emotion Recognition Using Data Augmentation
    Kapoor, Tanisha
    Ganguly, Arnaja
    Rajeswari, D.
    2024 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATION AND APPLIED INFORMATICS, ACCAI 2024, 2024,
  • [30] Speech emotion recognition using data augmentation
    V. M. Praseetha
    P. P. Joby
    International Journal of Speech Technology, 2022, 25 : 783 - 792