Efficient text document clustering with new similarity measures

被引:9
|
作者
Lakshmi R. [1 ]
Baskar S. [2 ]
机构
[1] Department of Computer Science and Engineering, K.L.N. College of Engineering, Sivagangai District, Tamilnadu
[2] Department of Electrical and Electronics Engineering, Thiagarajar College of Engineering, Madurai, Tamilnadu
关键词
Accuracy; Document clustering; Entropy; F-measure; Recall; Similarity measures;
D O I
10.1504/IJBIDM.2021.111741
中图分类号
学科分类号
摘要
In this paper, two new similarity measures, namely distance of term frequency-based similarity measure (DTFSM) and presence of common terms-based similarity measure (PCTSM), are proposed to compute the similarity between two documents for improving the effectiveness of text document clustering. The effectiveness of the proposed similarity measures is evaluated on reuters-21578 and WebKB datasets for clustering the documents using K-means and K-means++ clustering algorithms. The results obtained by using the proposed DTFSM and PCTSM are significantly better than other measures for document clustering in terms of accuracy, entropy, recall and F-measure. It is evident that the proposed similarity measures not only improve the effectiveness of the text document clustering, but also reduce the complexity of similarity measures based on the number of required operations during text document clustering. Copyright © 2021 Inderscience Enterprises Ltd.
引用
收藏
页码:109 / 126
页数:17
相关论文
共 50 条
  • [41] Genetic algorithm for text clustering using ontology and evaluating the validity of various semantic similarity measures
    Song, Wei
    Li, Cheng Hua
    Park, Soon Cheol
    EXPERT SYSTEMS WITH APPLICATIONS, 2009, 36 (05) : 9095 - 9104
  • [42] SIMILARITY MEASURES FOR NOMINAL VARIABLE CLUSTERING
    Sulc, Zdenek
    8TH INTERNATIONAL DAYS OF STATISTICS AND ECONOMICS, 2014, : 1536 - 1545
  • [43] Improved Similarity Measures For Software Clustering
    Naseem, Rashid
    Maqbool, Onaiza
    Muhammad, Siraj
    2011 15TH EUROPEAN CONFERENCE ON SOFTWARE MAINTENANCE AND REENGINEERING (CSMR), 2011, : 45 - 54
  • [44] Semantic Document Clustering Using a Similarity Graph
    Stanchev, Lubomir
    2016 IEEE TENTH INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING (ICSC), 2016, : 1 - 8
  • [45] Document Clustering in Correlation Similarity Measure Space
    Zhang, Taiping
    Tang, Yuan Yan
    Fang, Bin
    Xiang, Yong
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2012, 24 (06) : 1002 - 1013
  • [46] Optimal Text Document Clustering Enabled by Weighed Similarity Oriented Jaya With Grey Wolf Optimization Algorithm
    Venkanna, Gugulothu
    Bharati, K. F.
    COMPUTER JOURNAL, 2021, 64 (06): : 960 - 972
  • [47] Text Document Clustering: The Application of Cluster Analysis to Textual Document
    Reddy, Venkata Srikanth
    Kinnicutt, Patrick
    Lee, Roger
    2016 INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE & COMPUTATIONAL INTELLIGENCE (CSCI), 2016, : 1174 - 1179
  • [48] Text Document Clustering: The Application of Cluster Analysis to Textual Document
    2016, Institute of Electrical and Electronics Engineers Inc., United States
  • [49] Projections for efficient document clustering
    Schutze, H
    Silverstein, C
    PROCEEDINGS OF THE 20TH ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 1997, : 74 - 81
  • [50] Sentence Clustering in Text Document Using Fuzzy Clustering Algorithm
    Sruthi, S.
    Shalini, L.
    2014 INTERNATIONAL CONFERENCE ON CONTROL, INSTRUMENTATION, COMMUNICATION AND COMPUTATIONAL TECHNOLOGIES (ICCICCT), 2014, : 1473 - 1476