OPTIMIZATION OF TEXT DATABASE USING HIERACHICAL CLUSTERING

被引:2
|
作者
Tian, Jilei [1 ]
Nurminen, Jani [2 ]
机构
[1] Nokia Res Ctr, Media Lab, Tampere, Finland
[2] Nokia, Devices R&D, Tampere, Finland
关键词
hierarchical clustering; Levenshten distance; text data selection;
D O I
10.1109/ICASSP.2009.4960572
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Many speech and language related techniques employ models that are trained using text data. In this paper, we introduce a novel method for selecting optimized training sets from text databases. The coverage of the subset selected for training is optimized using hierarchical clustering and the generalized Levenshtein distance. The validity of the proposed subset optimization technique is verified in a data-driven syllabification task. The results clearly indicate that the proposed approach meaningfully optimizes the training set, which in turn improves the quality of the trained model. Compared to the existing state-of-the-art data selection technique, the proposed hierarchical clustering approach improves the compactness of data clusters, decreases the computational complexity and makes data set selection scalable. The presented idea can be used in a wide variety of language processing applications that require training with text data.
引用
收藏
页码:4269 / +
页数:2
相关论文
共 50 条
  • [21] A Text Clustering Algorithm Using an Online Clustering Scheme for Initialization
    Yin, Jianhua
    Wang, Jianyong
    KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, : 1995 - 2004
  • [22] An effective implementation of Social Spider Optimization for text document clustering using single cluster approach
    Chandran, T. Ravi
    Reddy, A. V.
    Janet, B.
    PROCEEDINGS OF THE 2018 SECOND INTERNATIONAL CONFERENCE ON INVENTIVE COMMUNICATION AND COMPUTATIONAL TECHNOLOGIES (ICICCT), 2018, : 508 - 511
  • [23] Efficient text document clustering approach using multi-search Arithmetic Optimization Algorithm
    Abualigah, Laith
    Almotairi, Khaled H.
    Al-qaness, Mohammed A. A.
    Ewees, Ahmed A.
    Yousri, Dalia
    Abd Elaziz, Mohamed
    Nadimi-Shahraki, Mohammad H.
    KNOWLEDGE-BASED SYSTEMS, 2022, 248
  • [24] Text Document Clustering Using Modified Particle Swarm Optimization with k-means Model
    Dodda, Ratnam
    Babu, A. Suresh
    INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS, 2024, 33 (01)
  • [25] Multi-label Text Categorization Based on Feature Optimization using Ant Colony Optimization and Relevance Clustering Technique
    Nema, Puneet
    Sharma, Vivek
    2015 INTERNATIONAL CONFERENCE ON COMPUTERS, COMMUNICATIONS, AND SYSTEMS (ICCCS), 2015, : 1 - 5
  • [27] Agricultural Ontology Based Feature Optimization for Agricultural Text Clustering
    Su Ya-ru
    Wang Ru-jing
    Chen Peng
    Wei Yuan-yuan
    Li Chuan-xi
    Hu Yi-min
    JOURNAL OF INTEGRATIVE AGRICULTURE, 2012, 11 (05) : 752 - 759
  • [28] Optimization Research based on the online comment clustering of short text
    Zhang, Ping
    Wang, Jianzhong
    2016 IEEE INFORMATION TECHNOLOGY, NETWORKING, ELECTRONIC AND AUTOMATION CONTROL CONFERENCE (ITNEC), 2016, : 838 - 842
  • [29] Hybrid clustering by integrating text and citation based graphs in journal database analysis
    Dept. of Electrical Engineering, K.U. Leuven, ESAT-SCD, Kasteelpark Arenberg 10, B-3001 Leuven, Belgium
    不详
    ICDM Workshops - IEEE Int. Conf. Data Min., (521-526):
  • [30] Hybrid Clustering by Integrating Text and Citation based Graphs in Journal Database Analysis
    Liu, Xinhai
    Yu, Shi
    Moreau, Yves
    Janssens, Frizo
    De Moor, Bart
    Glaenzel, Wolfgang
    2009 IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW 2009), 2009, : 521 - +