Predicting the out-of-vocabulary rate and the required vocabulary size for speech processing applications

被引:0
|
作者
Muller, J
Stahl, H
Lang, M
机构
关键词
out-of-vocabulary rate; OOV-rate; vocabulary size; text corpus; test corpus; training corpus;
D O I
暂无
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper describes an approach for predicting both the vocabulary size and the resulting out-of-vocabulary rate (OOV-rate) for a hypothetical extension of an existing text corpus. By splitting the original corpus into two different sub-corpora vocabulary and OOV-rate can be determined for that special constellation. Average values art calculated for all combinations of sub-corpora and can be approximated by analytic function terms. These functions enable the easy prediction of the vocabulary size and the OOV-rate. The prediction accuracy results in a relative error below 4.6%.
引用
收藏
页码:1922 / 1925
页数:4
相关论文
共 50 条
  • [21] PatchBERT: Just-in-Time, Out-of-Vocabulary Patching
    Moon, Sangwhan
    Okazaki, Naoaki
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 7846 - 7852
  • [22] WASSUP? LOL : Characterizing Out-of-Vocabulary Words in Twitter
    Maity, Suman Kalyan
    Chaudhary, Anshit
    Kumar, Shraman
    Mukherjee, Animesh
    Sarda, Chaitanya
    Patil, Abhijeet
    Mondal, Akash
    PROCEEDINGS OF THE 19TH ACM CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK AND SOCIAL COMPUTING COMPANION, 2016, : 341 - 344
  • [23] Out-of-vocabulary rejection based on selective attention model
    Park, KY
    Lee, SY
    NEURAL PROCESSING LETTERS, 2000, 12 (01) : 41 - 48
  • [24] Out-of-Vocabulary Rejection based on Selective Attention Model
    Ki-Young Park
    Soo-Young Lee
    Neural Processing Letters, 2000, 12 : 41 - 48
  • [25] A category based approach for recognition of out-of-vocabulary words
    Gallwitz, F
    Noth, E
    Niemann, H
    ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, 1996, : 228 - 231
  • [26] Detection of Out-of-Vocabulary Words in Posterior Based ASR
    Ketabdar, Hamed
    Hannemann, Mirko
    Hermansky, Hynek
    INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 2772 - 2775
  • [27] Subword RNNLM Approximations for Out-Of-Vocabulary Keyword Search
    Singh, Mittul
    Virpioja, Sami
    Smit, Peter
    Kurimo, Mikko
    INTERSPEECH 2019, 2019, : 4235 - 4239
  • [28] Similarity Scoring for Recognizing Repeated Out-of-Vocabulary Words
    Hannemann, Mirko
    Kombrink, Stefan
    Karafiat, Martin
    Burget, Lukas
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, 2010, : 897 - 900
  • [29] Triplet Confidence for Robust Out-of-vocabulary Keyword Spotting
    Wang, Chengliang
    Hao, Yujie
    Wu, Xing
    Liao, Chao
    2022 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS 22), 2022, : 3130 - 3134
  • [30] A two-pass approach for handling out-of-vocabulary words in a large vocabulary recognition task
    Scharenborg, Odette
    Seneff, Stephanie
    Boves, Lou
    COMPUTER SPEECH AND LANGUAGE, 2007, 21 (01): : 206 - 218