Efficient text document clustering with new similarity measures

被引:9
|
作者
Lakshmi R. [1 ]
Baskar S. [2 ]
机构
[1] Department of Computer Science and Engineering, K.L.N. College of Engineering, Sivagangai District, Tamilnadu
[2] Department of Electrical and Electronics Engineering, Thiagarajar College of Engineering, Madurai, Tamilnadu
关键词
Accuracy; Document clustering; Entropy; F-measure; Recall; Similarity measures;
D O I
10.1504/IJBIDM.2021.111741
中图分类号
学科分类号
摘要
In this paper, two new similarity measures, namely distance of term frequency-based similarity measure (DTFSM) and presence of common terms-based similarity measure (PCTSM), are proposed to compute the similarity between two documents for improving the effectiveness of text document clustering. The effectiveness of the proposed similarity measures is evaluated on reuters-21578 and WebKB datasets for clustering the documents using K-means and K-means++ clustering algorithms. The results obtained by using the proposed DTFSM and PCTSM are significantly better than other measures for document clustering in terms of accuracy, entropy, recall and F-measure. It is evident that the proposed similarity measures not only improve the effectiveness of the text document clustering, but also reduce the complexity of similarity measures based on the number of required operations during text document clustering. Copyright © 2021 Inderscience Enterprises Ltd.
引用
收藏
页码:109 / 126
页数:17
相关论文
共 50 条
  • [21] Text Document Preprocessing and Dimension Reduction Techniques for Text Document Clustering
    Kadhim, Ammar Ismael
    Cheah, Yu-N
    Ahamed, Nurul Hashimah
    PROCEEDINGS 2014 4TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE WITH APPLICATIONS IN ENGINEERING AND TECHNOLOGY ICAIET 2014, 2014, : 69 - 73
  • [22] Text document clustering and the space of concept on text document automatically generated
    Fu, WP
    Wu, B
    He, Q
    Shi, ZZ
    2001 INTERNATIONAL CONFERENCES ON INFO-TECH AND INFO-NET PROCEEDINGS, CONFERENCE A-G: INFO-TECH & INFO-NET: A KEY TO BETTER LIFE, 2001, : C107 - C112
  • [23] An improved Document Clustering Approach with Multi-Viewpoint based on different similarity measures
    Gupta, Anjali
    Dubey, Rahul
    PROCEEDINGS OF THE 2018 SECOND INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING AND CONTROL SYSTEMS (ICICCS), 2018, : 152 - 157
  • [24] Similarity Measures for Spatial Clustering
    Hamdad, Leila
    Benatchba, Karima
    Ifrez, Soraya
    Mohguen, Yasmine
    COMPUTATIONAL INTELLIGENCE AND ITS APPLICATIONS, 2018, 522 : 25 - 36
  • [25] Document Clustering Based on Fuzzy Similarity
    Zhou, Jingli
    Nie, Xuejun
    Qin, Leihua
    Zhu, Jianfeng
    APPLIED MECHANICS AND MECHANICAL ENGINEERING, PTS 1-3, 2010, 29-32 : 2620 - 2626
  • [26] Improving Suffix Tree Clustering with New Ranking and Similarity Measures
    Worawitphinyo, Phiradit
    Gao, Xiaoying
    Jabeen, Shahida
    ADVANCED DATA MINING AND APPLICATIONS, PT II, 2011, 7121 : 55 - 68
  • [27] Preprocessing method and similarity measures in clustering-based text mining: a preliminary study
    Iiritano, S
    Ruffolo, M
    Rullo, P
    DATA MINING IV, 2004, 7 : 73 - 79
  • [28] TIERED CITATION AND MEASURES OF DOCUMENT SIMILARITY
    CRONIN, B
    JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, 1994, 45 (07): : 537 - 538
  • [29] Centrality Measures for Text Clustering
    Iezzi, Domenica Fioredistella
    COMMUNICATIONS IN STATISTICS-THEORY AND METHODS, 2012, 41 (16-17) : 3179 - 3197
  • [30] Text clustering based on asymmetric similarity
    School of Software, Tsinghua University, Beijing 100084, China
    Qinghua Daxue Xuebao, 2006, 7 (1325-1328):