OPTIMIZATION OF TEXT DATABASE USING HIERACHICAL CLUSTERING

被引:2
|
作者
Tian, Jilei [1 ]
Nurminen, Jani [2 ]
机构
[1] Nokia Res Ctr, Media Lab, Tampere, Finland
[2] Nokia, Devices R&D, Tampere, Finland
关键词
hierarchical clustering; Levenshten distance; text data selection;
D O I
10.1109/ICASSP.2009.4960572
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Many speech and language related techniques employ models that are trained using text data. In this paper, we introduce a novel method for selecting optimized training sets from text databases. The coverage of the subset selected for training is optimized using hierarchical clustering and the generalized Levenshtein distance. The validity of the proposed subset optimization technique is verified in a data-driven syllabification task. The results clearly indicate that the proposed approach meaningfully optimizes the training set, which in turn improves the quality of the trained model. Compared to the existing state-of-the-art data selection technique, the proposed hierarchical clustering approach improves the compactness of data clusters, decreases the computational complexity and makes data set selection scalable. The presented idea can be used in a wide variety of language processing applications that require training with text data.
引用
收藏
页码:4269 / +
页数:2
相关论文
共 50 条
  • [31] Text clustering using VSM with feature clusters
    Cao Qimin
    Guo Qiao
    Wang Yongliang
    Wu Xianghua
    NEURAL COMPUTING & APPLICATIONS, 2015, 26 (04): : 995 - 1003
  • [32] Text Clustering Using Statistical and Semantic Data
    Benghabrit, Asmaa
    Ouhbi, Brahim
    Behja, Hicham
    Frikh, Bouchra
    WORLD CONGRESS ON COMPUTER & INFORMATION TECHNOLOGY (WCCIT 2013), 2013,
  • [33] Image clustering using generated text centroids
    Kong, Daehyeon
    Kong, Kyeongbo
    Kang, Suk-Ju
    SIGNAL PROCESSING-IMAGE COMMUNICATION, 2024, 125
  • [34] Clustering legal artifacts using text mining
    Lachana, Zoi
    Loutsaris, Michalis Avgerinos
    Alexopoulos, Charalampos
    Charalabidis, Yannis
    14TH INTERNATIONAL CONFERENCE ON THEORY AND PRACTICE OF ELECTRONIC GOVERNANCE (ICEGOV 2021), 2021, : 65 - 70
  • [35] Text Clustering Using Novel Hybrid Algorithm
    Dev, Divya D.
    Jebaruby, Merlin
    INTELLIGENT INFORMATION AND DATABASE SYSTEMS, PT 1, 2014, 8397 : 11 - 20
  • [36] Deep text clustering using stacked AutoEncoder
    Soodeh Hosseini
    Zahra Asghari Varzaneh
    Multimedia Tools and Applications, 2022, 81 : 10861 - 10881
  • [37] An Efficient Text Classification Scheme Using Clustering
    Thomas, Anisha Mariam
    Resmipriya, M. G.
    INTERNATIONAL CONFERENCE ON EMERGING TRENDS IN ENGINEERING, SCIENCE AND TECHNOLOGY (ICETEST - 2015), 2016, 24 : 1220 - 1225
  • [38] Evaluation of Text Clustering Methods Using WordNet
    Amine, Abdelmalek
    Elberrichi, Zakaria
    Simonet, Michel
    INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY, 2010, 7 (04) : 349 - 357
  • [39] Text clustering using VSM with feature clusters
    Cao Qimin
    Guo Qiao
    Wang Yongliang
    Wu Xianghua
    Neural Computing and Applications, 2015, 26 : 995 - 1003
  • [40] Text Classification using Clustering Techniques and PCA
    Kaur, Manpreet
    Bansal, Meenakshi
    2016 FOURTH INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND GRID COMPUTING (PDGC), 2016, : 642 - 646