Predicting the out-of-vocabulary rate and the required vocabulary size for speech processing applications

被引:0
|
作者
Muller, J
Stahl, H
Lang, M
机构
关键词
out-of-vocabulary rate; OOV-rate; vocabulary size; text corpus; test corpus; training corpus;
D O I
暂无
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper describes an approach for predicting both the vocabulary size and the resulting out-of-vocabulary rate (OOV-rate) for a hypothetical extension of an existing text corpus. By splitting the original corpus into two different sub-corpora vocabulary and OOV-rate can be determined for that special constellation. Average values art calculated for all combinations of sub-corpora and can be approximated by analytic function terms. These functions enable the easy prediction of the vocabulary size and the OOV-rate. The prediction accuracy results in a relative error below 4.6%.
引用
收藏
页码:1922 / 1925
页数:4
相关论文
共 50 条
  • [41] Robust Backed-off Estimation of Out-of-Vocabulary Embeddings
    Fukuda, Nobukazu
    Yoshinaga, Naoki
    Kitsuregawa, Masaru
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 4827 - 4838
  • [42] Variable-Span Out-of-Vocabulary Named Entity Detection
    Chen, Wei
    Ananthakrishnan, Sankaranarayanan
    Prasad, Rohit
    Natarajan, Prem
    14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 3728 - 3732
  • [43] Out-of-vocabulary word modeling using multiple lexical fillers
    Boulianne, G
    Dumouchel, P
    ASRU 2001: IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING, CONFERENCE PROCEEDINGS, 2001, : 226 - 229
  • [44] Term-Dependent Confidence for Out-of-Vocabulary Term Detection
    Wang, Dong
    King, Simon
    Frankel, Joe
    Bell, Peter
    INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 2103 - 2106
  • [45] Few-Shot Representation Learning for Out-Of-Vocabulary Words
    Hu, Ziniu
    Chen, Ting
    Chang, Kai-Wei
    Sun, Yizhou
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 4102 - 4112
  • [46] Seed-Guided Topic Discovery with Out-of-Vocabulary Seeds
    Zhang, Yu
    Meng, Yu
    Wang, Xuan
    Wang, Sheng
    Han, Jiawei
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 279 - 290
  • [47] Multi-level out-of-vocabulary words handling approach
    Lochter, Johannes V.
    Silva, Renato M.
    Almeida, Tiago A.
    KNOWLEDGE-BASED SYSTEMS, 2022, 251
  • [48] Recurrent out-of-vocabulary word detection based on distribution of features
    Asami, Taichi
    Masumura, Ryo
    Aono, Yushi
    Shinoda, Koichi
    COMPUTER SPEECH AND LANGUAGE, 2019, 58 : 247 - 259
  • [49] Rejection of out-of-vocabulary words using phoneme confidence likelihood
    Jitsuhiro, T
    Takahashi, S
    Aikawa, K
    PROCEEDINGS OF THE 1998 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-6, 1998, : 217 - 220
  • [50] Exploiting Out-of-Vocabulary Words for Out-of-Domain Detection in Dialog Systems
    Ryu, Seonghan
    Lee, Donghyeon
    Lee, Gary Geunbae
    Kim, Kyungduk
    Noh, Hyungjong
    2014 INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP), 2014, : 165 - +