A parallel hybrid web document clustering algorithm and its performance study

被引:16
|
作者
Xu, ST [1 ]
Zhang, J [1 ]
机构
[1] Univ Kentucky, Dept Comp Sci, Lab High Performance Sci Comp & Comp Simulat, Lexington, KY 40506 USA
来源
JOURNAL OF SUPERCOMPUTING | 2004年 / 30卷 / 02期
关键词
information retrieval; parallel document clustering; PDDP; K-means;
D O I
10.1023/B:SUPE.0000040611.25862.d9
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Clustering web document is an important procedure in many web information retrieval systems. As the size of the Internet grows rapidly and the amount of information requests increases exponentially, the use of parallel computing techniques in large scale web document retrieval is unavoidable. We propose a parallel hybrid web document clustering algorithm, which combines the Principal Direction Divisive Partitioning (PDDP) algorithm with the K-means algorithm. Computational experiments were conducted to test the performance of the hybrid algorithm using three real life web document datasets, and the results were compared with that of the parallel PDDP algorithm and the parallel K-means algorithm. The experiments show that the quality of the clustering solutions obtained from the hybrid algorithm is better than that from the parallel PDDP or the parallel K-means. The parallel run time of the hybrid algorithm is similar to and sometimes less than that of the widely used K-means algorithm.
引用
收藏
页码:117 / 131
页数:15
相关论文
共 50 条
  • [31] Web document clustering using Document Index Graph
    Momin, B. F.
    Kulkarni, P. J.
    Chaudhari, Amol
    2006 INTERNATIONAL CONFERENCE ON ADVANCED COMPUTING AND COMMUNICATIONS, VOLS 1 AND 2, 2007, : 30 - 35
  • [32] A review of Web document clustering approaches
    Oikonomakou, N
    Vazirgiannis, M
    TEXT MINING AND ITS APPLICATIONS, 2004, 138 : 65 - 79
  • [33] Graph representations for Web document clustering
    Schenker, A
    Last, M
    Bunke, H
    Kandel, A
    PATTERN RECOGNITION AND IMAGE ANALYSIS, PROCEEDINGS, 2003, 2652 : 935 - 942
  • [34] Document clustering with hierarchical algorithm
    Wang, Y
    Hodges, J
    Proceedings of the 8th Joint Conference on Information Sciences, Vols 1-3, 2005, : 1614 - 1617
  • [35] A NEURAL ALGORITHM FOR DOCUMENT CLUSTERING
    MACLEOD, KJ
    ROBERTSON, W
    INFORMATION PROCESSING & MANAGEMENT, 1991, 27 (04) : 337 - 346
  • [36] Review of Web Document Clustering Algorithms
    Sahu, Sanjib Kumar
    Srivastava, Shalini
    PROCEEDINGS OF THE 10TH INDIACOM - 2016 3RD INTERNATIONAL CONFERENCE ON COMPUTING FOR SUSTAINABLE GLOBAL DEVELOPMENT, 2016, : 1153 - 1155
  • [37] Application of Hybrid Clustering using Parallel K-Means Algorithm and DIANA Algorithm
    Umam, Khoirul
    Bustamam, Alhadi
    Lestari, Dian
    SYMPOSIUM ON BIOMATHEMATICS (SYMOMATH 2016), 2017, 1825
  • [38] A Categorical Data Clustering Algorithm and Its Efficient Parallel Implementation
    Ding, Xiangwu
    Tan, Jia
    Wang, Mei
    PROCEEDINGS OF 2016 5TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND NETWORK TECHNOLOGY (ICCSNT), 2016, : 224 - 228
  • [39] Design and evaluation of a parallel document clustering algorithm based on hierarchical latent semantic analysis
    Seshadri, Karthick
    Iyer, K. Viswanathan
    Shalinie, Mercy S.
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2019, 31 (13):
  • [40] K-means algorithm based on particle swarm optimization for web document clustering
    Xiao, L. Z.
    Shao, Z. Q.
    Gu, X. M.
    DYNAMICS OF CONTINUOUS DISCRETE AND IMPULSIVE SYSTEMS-SERIES B-APPLICATIONS & ALGORITHMS, 2006, 13E : 980 - 984