A study of unsupervised clustering techniques for language modeling

被引:0
|
作者
Hahn, Sangyun [1 ]
Sethy, Abhinav [2 ]
Kuo, Hong-Kwang J. [2 ]
Ramabhadran, Bhuvana [2 ]
机构
[1] Univ Washington, Dept Comp Sci & Engn, Seattle, WA 98195 USA
[2] IBM TJ Watson Res Ctr, Yorktown Hts, NY 10598 USA
来源
INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5 | 2008年
关键词
Clustering; Language Model Adaptation; Entropy;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
There has been recent interest in clustering text data to build topic-specific language models for large vocabulary speech recognition. In this paper, we studied various unsupervised clustering algorithms on several corpora. First we compared the clustering methods with quality metrics such as entropy and purity. Of the techniques studied, two-phase bisecting K-means achieved good performance with relatively fast speed. Then we performed speech recognition experiments on English and Arabic systems using the automatically derived topic-based language models. We obtained modest word error rate improvements, comparable to previously published studies. A careful analysis of the correlation between word error rate and the distribution of misrecognized words, including an information-gain metric, is presented.
引用
收藏
页码:1598 / +
页数:2
相关论文
共 50 条
  • [31] A Review of Unsupervised K-Value Selection Techniques in Clustering Algorithms
    Pegado-Bardayo, Ana
    Lorenzo-Espejo, Antonio
    Munuzuri, Jesus
    Escudero-Santana, Alejandro
    JOURNAL OF INDUSTRIAL ENGINEERING AND MANAGEMENT-JIEM, 2024, 17 (03): : 641 - 649
  • [32] Statistical and linguistic clustering for language modeling in ASR
    Justo, R
    Torres, I
    PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS AND APPLICATIONS, PROCEEDINGS, 2005, 3773 : 556 - 565
  • [33] Exploring asymmetric clustering for statistical language modeling
    Gao, JF
    Goodman, JT
    Cao, GH
    Li, H
    40TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, 2002, : 183 - 190
  • [34] Validation of Clustering Techniques for User Group Modeling
    Zakrzewska, Danuta
    PROCEEDINGS OF THE 8TH INTERNATIONAL CONFERENCE ON COMPUTER RECOGNITION SYSTEMS CORES 2013, 2013, 226 : 723 - 732
  • [35] Modeling Athlete Performance Using Clustering Techniques
    Li, Yingying
    Chiusano, Silvia
    D'Elia, Ing Vincenzo
    THIRD INTERNATIONAL SYMPOSIUM ON ELECTRONIC COMMERCE AND SECURITY WORKSHOPS (ISECS 2010), 2010, : 169 - 171
  • [36] Analysis of Unsupervised Machine Learning Techniques for an Efficient Customer Segmentation using Clustering Ensemble and Spectral Clustering
    Hicham, Nouri
    Karim, Sabri
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2022, 13 (10) : 122 - 130
  • [37] A survey of modeling language specification techniques
    Bork, Dominik
    Karagiannis, Dimitris
    Pittl, Benedikt
    INFORMATION SYSTEMS, 2020, 87
  • [38] CONTINUOUS SPACE LANGUAGE MODELING TECHNIQUES
    Sarikaya, Ruhi
    Emami, Ahmad
    Afify, Mohamed
    Ramabhadran, Bhuvana
    2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 5186 - 5189
  • [39] Unsupervised clustering and feature weighting based on Generalized Dirichlet mixture modeling
    Ben Ismail, Mohamed Maher
    Frigui, Hichem
    INFORMATION SCIENCES, 2014, 274 : 35 - 54
  • [40] Unsupervised Discriminative Language Modeling Using Error Rate Estimator
    Oba, Takanobu
    Ogawa, Atsunori
    Hori, Takaaki
    Masataki, Hirokazu
    Nakamura, Atsushi
    14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 1222 - 1226