A study of unsupervised clustering techniques for language modeling

被引:0
|
作者
Hahn, Sangyun [1 ]
Sethy, Abhinav [2 ]
Kuo, Hong-Kwang J. [2 ]
Ramabhadran, Bhuvana [2 ]
机构
[1] Univ Washington, Dept Comp Sci & Engn, Seattle, WA 98195 USA
[2] IBM TJ Watson Res Ctr, Yorktown Hts, NY 10598 USA
来源
INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5 | 2008年
关键词
Clustering; Language Model Adaptation; Entropy;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
There has been recent interest in clustering text data to build topic-specific language models for large vocabulary speech recognition. In this paper, we studied various unsupervised clustering algorithms on several corpora. First we compared the clustering methods with quality metrics such as entropy and purity. Of the techniques studied, two-phase bisecting K-means achieved good performance with relatively fast speed. Then we performed speech recognition experiments on English and Arabic systems using the automatically derived topic-based language models. We obtained modest word error rate improvements, comparable to previously published studies. A careful analysis of the correlation between word error rate and the distribution of misrecognized words, including an information-gain metric, is presented.
引用
收藏
页码:1598 / +
页数:2
相关论文
共 50 条
  • [1] UNSUPERVISED CLUSTERING OF SYLLABLES FOR LANGUAGE IDENTIFICATION
    Dey, Subhadeep
    Murthy, Hema
    2012 PROCEEDINGS OF THE 20TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2012, : 325 - 329
  • [2] Comparative Study of Particle Swarm Optimization based Unsupervised Clustering Techniques
    Panchal, V. K.
    Kundra, Harish
    Kaur, Jagdeep
    INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2009, 9 (10): : 132 - 140
  • [3] Unsupervised Clustering of Comments Written in Albanian Language
    Hoti, Mergim H.
    Ajdari, Jaumin
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2021, 12 (08) : 287 - 292
  • [4] An empirical study of smoothing techniques for language modeling
    Chen, SF
    Goodman, J
    COMPUTER SPEECH AND LANGUAGE, 1999, 13 (04): : 359 - 394
  • [5] An empirical study of smoothing techniques for language modeling
    Chen, Stanley F.
    Goodman, Joshua
    Proceedings of the Annual Meeting of the Association for Computational Linguistics, 1996, 1996-June : 310 - 318
  • [6] Unsupervised Latent Speaker Language Modeling
    Tam, Yik-Cheung
    Vozila, Paul
    12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 1488 - 1491
  • [7] Unsupervised Accent Modeling for Language Identification
    Martinez Gonzalez, David
    Villalba Lopez, Jesus
    Lleida Solano, Eduardo
    Ortega Gimenez, Alfonso
    ADVANCES IN SPEECH AND LANGUAGE TECHNOLOGIES FOR IBERIAN LANGUAGES, IBERSPEECH 2014, 2014, 8854 : 49 - 58
  • [8] MRI image segmentation using unsupervised clustering techniques
    Selvathi, D
    Arulmurgan, A
    Selvi, TS
    Alagappan, S
    ICCIMA 2005: Sixth International Conference on Computational Intelligence and Multimedia Applications, Proceedings, 2005, : 105 - 110
  • [9] Optimized Cluster Validation Technique for Unsupervised Clustering Techniques
    Krishnamoorthy, R.
    Kumar, S. Sreedhar
    2014 INTERNATIONAL CONFERENCE ON INFORMATION COMMUNICATION AND EMBEDDED SYSTEMS (ICICES), 2014,
  • [10] The "Language Filter" Hypothesis: A Feasibility Study of Language Separation in Infancy using Unsupervised Clustering of I-vectors
    Carbajal, M. Julia
    Dawud, Ahmad
    Thiolliere, Roland
    Dupoux, Emmanuel
    2016 JOINT IEEE INTERNATIONAL CONFERENCE ON DEVELOPMENT AND LEARNING AND EPIGENETIC ROBOTICS (ICDL-EPIROB), 2016, : 195 - 201