OPTIMIZATION OF TEXT DATABASE USING HIERACHICAL CLUSTERING

被引:2
|
作者
Tian, Jilei [1 ]
Nurminen, Jani [2 ]
机构
[1] Nokia Res Ctr, Media Lab, Tampere, Finland
[2] Nokia, Devices R&D, Tampere, Finland
关键词
hierarchical clustering; Levenshten distance; text data selection;
D O I
10.1109/ICASSP.2009.4960572
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Many speech and language related techniques employ models that are trained using text data. In this paper, we introduce a novel method for selecting optimized training sets from text databases. The coverage of the subset selected for training is optimized using hierarchical clustering and the generalized Levenshtein distance. The validity of the proposed subset optimization technique is verified in a data-driven syllabification task. The results clearly indicate that the proposed approach meaningfully optimizes the training set, which in turn improves the quality of the trained model. Compared to the existing state-of-the-art data selection technique, the proposed hierarchical clustering approach improves the compactness of data clusters, decreases the computational complexity and makes data set selection scalable. The presented idea can be used in a wide variety of language processing applications that require training with text data.
引用
收藏
页码:4269 / +
页数:2
相关论文
共 50 条
  • [1] Text Clustering on National Vulnerability Database
    Huang, Shuguang
    Tang, Heping
    Zhang, Min
    Tian, Jie
    2010 SECOND INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING AND APPLICATIONS: ICCEA 2010, PROCEEDINGS, VOL 2, 2010, : 295 - 299
  • [2] Knowledge Discovery in a Facility Condition Assessment Database Using Text Clustering
    Ng, H. S.
    Toukourou, A.
    Soibelman, L.
    JOURNAL OF INFRASTRUCTURE SYSTEMS, 2006, 12 (01) : 50 - 59
  • [3] Text document clustering using Spectral Clustering algorithm with Particle Swarm Optimization
    Janani, R.
    Vijayarani, S.
    EXPERT SYSTEMS WITH APPLICATIONS, 2019, 134 : 192 - 200
  • [4] A hybrid approach for text document clustering using Jaya optimization algorithm
    Thirumoorthy, Karpagalingam
    Muneeswaran, Karuppaiah
    EXPERT SYSTEMS WITH APPLICATIONS, 2021, 178
  • [5] A Novel Hybrid Method for Clustering Text Documents using Evolutionary Optimization
    Naderi, Muhammad
    Amiri, Maryam
    2023 13th International Conference on Computer and Knowledge Engineering, ICCKE 2023, 2023, : 369 - 374
  • [6] Text Clustering using Ensemble Clustering Technique
    Mateen, Muhammad
    Wen, Junhao
    Song, Sun
    Hassan, Mehdi
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2018, 9 (09) : 185 - 190
  • [7] Swarm Intelligent Optimization Algorithm for Text Clustering
    Peng Hong
    Wang Cong
    Guan Xin
    PROCEEDINGS OF 2010 3RD IEEE INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND INFORMATION TECHNOLOGY (ICCSIT 2010), VOL 5, 2010, : 200 - 203
  • [8] COSUM: Text summarization based on clustering and optimization
    Alguliyev, Rasim M.
    Aliguliyev, Ramiz M.
    Isazade, Nijat R.
    Abdi, Asad
    Idris, Norisma
    EXPERT SYSTEMS, 2019, 36 (01)
  • [9] Text Clustering via Particle Swarm Optimization
    Lu, Yanping
    Wang, Shengrui
    Li, Shaozi
    Zhou, Changle
    2009 IEEE SWARM INTELLIGENCE SYMPOSIUM, 2009, : 45 - +
  • [10] Wirelength Optimization For Multilevel Hierachical FPGA
    Zeng, Xiangzhi
    Zhou, Qiang
    Cai, Yici
    Hong, Xianlong
    2009 WRI WORLD CONGRESS ON SOFTWARE ENGINEERING, VOL 4, PROCEEDINGS, 2009, : 361 - 366