A Spoken Term Detection Framework for Recovering Out-of-Vocabulary Words Using the Web

被引:0
|
作者
Parada, Carolina [1 ]
Sethy, Abhinav [2 ]
Dredze, Mark [1 ]
Jelinek, Frederick [1 ]
机构
[1] Johns Hopkins Univ, Ctr Language & Speech Proc, Human Language Technol Ctr Excellence, 3400 N Charles St, Baltimore, MD 21210 USA
[2] IBM TJ Watson Res Ctr, New York, NY 10598 USA
关键词
language modeling; data selection; spoken term detection; oov detection;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vocabulary restrictions in large vocabulary continuous speech recognition (LVCSR) systems mean that out-of-vocabulary (OOV) words are lost in the output. However, OOV words tend to be information rich terms (often named entities) and their omission from the transcript negatively affects both usability and downstream NLP technologies, such as machine translation or knowledge distillation. We propose a novel approach to OOV recovery that uses a spoken term detection (STD) framework. Given an identified OOV region in the LVCSR output, we recover the uttered OOVs by utilizing contextual information and the vast and constantly updated vocabulary on the Web. Discovered words are integrated into system output, recovering up to 40% of OOVs and resulting in a reduction in system error.
引用
收藏
页码:1269 / +
页数:2
相关论文
共 50 条
  • [11] Finding Recurrent Out-of-Vocabulary Words
    Qin, Long
    Rudnicky, Alexander
    14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 2241 - 2245
  • [12] SUB-WORD MODELING OF OUT OF VOCABULARY WORDS IN SPOKEN TERM DETECTION
    Szoke, Igor
    Burget, Lukas
    Cernocky, Jan
    Fapso, Michal
    2008 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY: SLT 2008, PROCEEDINGS, 2008, : 273 - 276
  • [13] Exploiting Out-of-Vocabulary Words for Out-of-Domain Detection in Dialog Systems
    Ryu, Seonghan
    Lee, Donghyeon
    Lee, Gary Geunbae
    Kim, Kyungduk
    Noh, Hyungjong
    2014 INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP), 2014, : 165 - +
  • [14] Term-Dependent Confidence for Out-of-Vocabulary Term Detection
    Wang, Dong
    King, Simon
    Frankel, Joe
    Bell, Peter
    INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 2103 - 2106
  • [15] Rejection of out-of-vocabulary words using phoneme confidence likelihood
    Jitsuhiro, T
    Takahashi, S
    Aikawa, K
    PROCEEDINGS OF THE 1998 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-6, 1998, : 217 - 220
  • [16] Lexicon Stratification for Translating Out-of-Vocabulary Words
    Tsvetkov, Yulia
    Dyer, Chris
    PROCEEDINGS OF THE 53RD ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL) AND THE 7TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (IJCNLP), VOL 2, 2015, : 125 - 131
  • [17] FastContext: Handling Out-of-Vocabulary Words Using the Word Structure and Context
    Silva, Renato M.
    Lochter, Johannes, V
    Almeida, Tiago A.
    Yamakami, Akebo
    INTELLIGENT SYSTEMS, PT II, 2022, 13654 : 539 - 557
  • [18] Incorporate web search technology to solve out-of-vocabulary words in Chinese word segmentation
    Qiao, Wei
    Sun, Maosong
    PACLIC 23 - Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, 2009, 2 : 454 - 463
  • [19] Out-of-Vocabulary Word Detection and Beyond
    Kombrink, Stefan
    Hannemann, Mirko
    Burget, Lukas
    DETECTION AND IDENTIFICATION OF RARE AUDIOVISUAL CUES, 2012, 384 : 57 - 65
  • [20] RNN Language Model Estimation for Out-of-Vocabulary Words
    Illina, Irina
    Fohr, Dominique
    HUMAN LANGUAGE TECHNOLOGY. CHALLENGES FOR COMPUTER SCIENCE AND LINGUISTICS, LTC 2017, 2020, 12598 : 199 - 211