Speech Command Recognition: Text-to-Speech and Speech Corpus Scraping Are All You Need

被引:0
|
作者
Kuzdeuov, Askat [1 ]
Nurgaliyev, Shakhizat [1 ]
Turmakhan, Diana [1 ]
Laiyk, Nurkhan [1 ]
Varol, Huseyin Atakan [1 ]
机构
[1] Nazarbayev Univ, Inst Smart Syst & AI, Astana, Kazakhstan
关键词
Speech commands recognition; text-to-speech; Kazakh Speech Corpus; voice commands; data-centric AI;
D O I
10.1109/RAAI59955.2023.10601292
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Speech Command Recognition (SCR) is rapidly gaining prominence due to its diverse applications, such as virtual assistants, smart homes, hands-free navigation, and voice-controlled industrial machinery. In this paper, we present a data-centric approach to creating SCR systems for low-resource languages, particularly focusing on the Kazakh language. By leveraging synthetic data generated by Text-to-Speech (TTS) and data extracted from a large-scale speech corpus, we successfully created the Kazakh language equivalent of the Google Speech Commands dataset. Moreover, we also compiled the Kazakh Speech Commands dataset with data collected from 119 participants. This dataset was used to benchmark the performance of the Keyword-MLP model trained using our synthetic dataset. The results showed that the model achieves 89.79% accuracy for the real-world data demonstrating the efficacy of our approach. Our work can serve as a recipe for creating customized speech command datasets, including for low-resource languages, obviating the need for laborious and costly human data collection.
引用
收藏
页码:286 / 291
页数:6
相关论文
共 50 条
  • [21] The Art of Text-to-Speech
    Lindquist, Benjamin
    CRITICAL INQUIRY, 2024, 50 (02) : 225 - 251
  • [22] Software text-to-speech
    Hallahan W.I.
    International Journal of Speech Technology, 1997, 1 (2) : 121 - 134
  • [23] Learning Emotional Representations from Imbalanced Speech Data for Speech Emotion Recognition and Emotional Text-to-Speech
    Wang, Shijun
    Gudnason, Jon
    Borth, Damian
    INTERSPEECH 2023, 2023, : 351 - 355
  • [24] DeepMine-multi-TTS: a Persian speech corpus for multi-speaker text-to-speech
    Adibian, Majid
    Zeinali, Hossein
    Barmaki, Soroush
    LANGUAGE RESOURCES AND EVALUATION, 2025,
  • [25] CLUSTERING OF DURATION PATTERNS IN SPEECH FOR TEXT-TO-SPEECH SYNTHESIS
    Sreelekshmi, K. S.
    Gopinath, Deepa P.
    2012 ANNUAL IEEE INDIA CONFERENCE (INDICON), 2012, : 1122 - 1127
  • [26] Part of Speech Tagging for Romanian Text-to-Speech System
    Teodorescu, Lucian Radu
    Boldizsar, Razvan
    Ordean, Mihai
    Duma, Melania
    Detesan, Laura
    Ordean, Mihaela
    13TH INTERNATIONAL SYMPOSIUM ON SYMBOLIC AND NUMERIC ALGORITHMS FOR SCIENTIFIC COMPUTING (SYNASC 2011), 2012, : 153 - 159
  • [27] Slovenian Text-to-Speech Synthesis for Speech User Interfaces
    Gros, Jerneja Zganec
    Mihelic, Ales
    Pavesic, Nikola
    Zganec, Mario
    Gruden, Stanislav
    PROCEEDINGS OF WORLD ACADEMY OF SCIENCE, ENGINEERING AND TECHNOLOGY, VOL 5, 2005, 5 : 216 - 220
  • [28] Synthesizing Speech Test Cases with Text-to-Speech? An Empirical Study on the False Alarms in Automated Speech Recognition Testing
    Lau, Julia Kaiwen
    Kong, Kelvin Kai Wen
    Yong, Julian Hao
    Tan, Per Hoong
    Yang, Zhou
    Yong, Zi Qian
    Low, Joshua Chern Wey
    Chong, Chun Yong
    Lim, Mei Kuan
    Lo, David
    PROCEEDINGS OF THE 32ND ACM SIGSOFT INTERNATIONAL SYMPOSIUM ON SOFTWARE TESTING AND ANALYSIS, ISSTA 2023, 2023, : 1169 - 1181
  • [29] Building a Speech and Text Corpus of Turkish: Large Corpus Collection with Initial Speech Recognition Results
    Polat, Huseyin
    Oyucu, Saadin
    SYMMETRY-BASEL, 2020, 12 (02):
  • [30] Development of multi-lingual speech recognition and text-to-speech synthesis for automotive applications
    Deguchi, Y
    Kagoshima, T
    Hirabayashi, G
    Kanazawa, H
    TELEMATCS FOR VEHICLES, 2002, 1728 : 233 - 240